AI for Scientific Discovery: Three Milestones and a Benchmark Map

This post traces the three milestones on the path from "LLMs as conversational assistants" to "LLMs as scientific discovery engines": FunSearch (DeepMind 2023) which made the basic loop work on cap sets and bin packing, AlphaEvolve (DeepMind 2025) which scaled it to Google production, and TTT-Discover (Yuksekgonul, Koceja, Sun et al., 2026) which adds the missing piece — training the LLM at inference time. The latter beats AlphaEvolve on the Erdős minimum-overlap problem, beats hand-tuned GPU kernel champions on four chip generations, outscores top humans on AtCoder, and matches bespoke methods on single-cell RNA denoising — all with one method costing a few hundred dollars per problem.
本文梳理从"LLM 作聊天助手"到"LLM 作科学发现引擎"路径上的三个里程碑:FunSearch(DeepMind 2023)在 cap set 与 bin packing 问题上跑通了基础循环,AlphaEvolve(DeepMind 2025)把它扩展到 Google 生产规模,TTT-DiscoverYuksekgonul, Koceja, Sun 等,2026)加上缺失的一环——在推理时训练 LLM。后者在 Erdős minimum-overlap 问题上超过 AlphaEvolve,在四代芯片上的 GPU kernel 竞赛中击败手工调优冠军,在 AtCoder 上超过顶尖人类选手,并在单细胞 RNA 去噪上匹敌定制方法——所有结果来自同一方法,每问题成本几百美元。

A Warm-Up: The Cap Set Problem

Before we look at any method, let’s spend a few minutes on one concrete problem all three systems will try to solve — the cap set problem. We’ll work up to the formal statement slowly, starting from a card game you may have played. If you already know the problem from additive combinatorics, you can skim this section.

在看任何方法前,我们花几分钟介绍一个具体问题——三个系统都会尝试解的 cap set 问题。从你可能玩过的一个卡牌游戏出发,慢慢推到正式陈述。如果你已经从加性组合学里熟悉这个问题,可以跳过本节。

Start With the Card Game SET

You may know the card game SET. Each card has four attributes — color, shape, number, shading — and each attribute takes one of three values (red/green/purple, oval/squiggle/diamond, 1/2/3, solid/striped/open). So the full deck has \(3 \times 3 \times 3 \times 3 = 81\) cards. A “SET” is three cards such that, for each of the four attributes, the three cards are either all the same or all different. Players race to spot SETs on a shared table.

Hidden inside the game is a natural puzzle:

The no-SET puzzle. Lay out a bunch of cards on the table. What’s the largest number of cards you can put down with no three of them forming a SET?

The answer is 20. You can lay 20 cards from the 81-card deck without any three forming a SET; you cannot lay 21. The constructions that achieve 20 are not obvious — they have specific structure. Our goal in the rest of this warm-up is to turn this innocent-looking card-game question into a mathematical problem you can actually attack.

你可能听说过卡牌游戏 SET。每张卡有四个属性——颜色形状数量填充——每个属性取三种取值之一(红/绿/紫、椭圆/波浪/菱形、1/2/3、实心/条纹/空心)。整副 deck 共 \(3 \times 3 \times 3 \times 3 = 81\) 张牌。一个”SET”是三张卡,使得 四个属性 上这三张卡 要么全相同、要么全不同。玩家在共享桌面上抢着找出 SET。

游戏里藏着一个自然的谜题:

无 SET 谜题。 在桌上摆一堆卡。在不让任何三张构成 SET 的前提下,桌上 最多 能放多少张?

答案是 20。81 张 deck 里你能摆 20 张而其中没有 SET,但摆不到 21 张。达到 20 张的构造并不显然,它们有特定结构。本节余下部分的目标是把这个看似无害的卡牌游戏问题翻译成一个你可以真正攻击的数学问题。

Encoding the SET Rule as Math

To reason about the no-SET puzzle we leave the cards behind and work with numbers. Label each attribute value as 0, 1, or 2. (For color, say red = 0, green = 1, purple = 2. The choice is arbitrary; what matters is that each attribute has three labels.) A card now becomes a 4-tuple of digits, e.g., \((0, 2, 1, 1)\) = “red, diamond, 2 shapes, striped”. The 81-card deck is exactly all possible such tuples. Mathematicians write this collection as \(\mathbb{F}_3^4\) — but you can read the symbol as nothing more than “4-tuples whose entries are 0, 1, or 2.”

Now look at when three cards form a SET. Pick any one attribute (one coordinate) and consider the three cards’ values there. By the SET rule, the three values are either all the same or all different. Sum the three values in each case:

  • All same, e.g. \(1 + 1 + 1 = 3\). Divisible by 3.
  • All different — three values \(0, 1, 2\) in some order — \(0 + 1 + 2 = 3\). Divisible by 3.
  • Two same one different (a non-SET case), e.g. \(0 + 0 + 1 = 1\). Not divisible by 3.

So the SET rule, attribute by attribute, is exactly the rule “three values sum to a multiple of 3.” Stacking the four attributes back together:

Three cards \(x, y, z\) form a SET \(\iff x + y + z = (0,0,0,0)\) in \(\mathbb{F}_3^4\), where addition is taken coordinate-wise mod 3.

This is the small but important translation step. The puzzle “what’s the largest no-SET collection of cards?” has become a clean mathematical question:

What’s the largest subset of \(\mathbb{F}_3^4\) in which no three distinct elements sum to zero (mod 3)?

要对无 SET 谜题做推理,我们丢掉卡牌、用数字工作。把每个属性的取值编号为 0、1、2。(颜色上可设红 = 0、绿 = 1、紫 = 2。这种选择是任意的;要紧的是每个属性有 三个 标签。)一张卡现在是一个 4-元组数字,如 \((0, 2, 1, 1)\) = “红、菱形、2 个形状、条纹”。81 张 deck 恰是所有这种元组的集合,记作 \(\mathbb{F}_3^4\)——你完全可以把这个符号读作”四项元组,每项是 0、1 或 2”。

现在看三张卡何时构成 SET。取任一属性(一个坐标),看三张卡在该坐标的值。按 SET 规则,三值要么全相同、要么全不同。把三个值相加:

  • 全相同,如 \(1 + 1 + 1 = 3\)。被 3 整除。
  • 全不同——三值是 \(0, 1, 2\) 的某种排列——\(0 + 1 + 2 = 3\)。被 3 整除。
  • 两同一异(非 SET 情形),如 \(0 + 0 + 1 = 1\)。不被 3 整除。

所以 SET 规则逐属性来看 恰好 是”三值之和为 3 的倍数”。把四个属性叠起来:

三张卡 \(x, y, z\) 构成 SET \(\iff x + y + z = (0,0,0,0)\) 在 \(\mathbb{F}_3^4\) 中,加法逐坐标模 3。

这是小但关键的翻译。谜题”最大无 SET 卡牌集合多大”变成了一个干净的数学问题:

\(\mathbb{F}_3^4\) 的最大子集多大,使其中任意三个不同元素之和不为零(mod 3)?

Cap Sets in F3n, and Why They're Hard

There’s nothing magic about four attributes. Replace 4 with any positive integer \(n\) and you get a whole family of problems — the cap set problem:

Given \(\mathbb{F}_3^n\) (the set of \(n\)-tuples with entries in \(\{0, 1, 2\}\)), find the largest subset such that no three distinct elements sum to zero coordinate-wise mod 3. A subset with this property is called a cap set in \(\mathbb{F}_3^n\).

Small cases can be computed (or, for \(n \leq 6\), even proved exactly):

\(n\) 1 2 3 4 (SET deck) 5 6
Max cap set size 2 4 9 20 45 112

For \(n \geq 7\) the exact answer is unknown; only upper and lower bounds exist. For \(n = 8\) the largest known explicit construction sits at 496 — and improving that by even one element is genuine research. FunSearch produced a construction of size 512 for \(n = 8\), which is exactly such a research-level advance.

Why is the problem so hard? The space \(\mathbb{F}_3^n\) has \(3^n\) elements, so it has \(2^{3^n}\) subsets. For \(n = 8\) that’s \(2^{6561}\), a number with roughly 1975 decimal digits. Brute-force search over subsets is hopeless. You need a construction — a clever recipe that builds a large subset while systematically avoiding any three elements that sum to zero. Such constructions are rare and typically come from deep structural insights: group theory, error-correcting codes, additive number theory.

The slow progress is telling. Between 2002 and 2023, the largest known cap set in \(\mathbb{F}_3^8\) was improved by humans only a handful of times. FunSearch’s 2023 contribution is in this slow line.

To get a visceral sense for how the constraint bites, try the simplest non-trivial case yourself. The widget below is a 3×3 grid of \(\mathbb{F}_3^2\). Click points to add them to your cap set; any three selected points whose coordinates sum to \((0, 0)\) mod 3 will be flagged with a red dashed triangle. The maximum cap set in \(\mathbb{F}_3^2\) has size 4 — try to find one. The surprise is usually how few points you need before some “wrap-around” triple snaps shut.

4 个属性没什么特别。把 4 换成任意正整数 \(n\),就得到一整族问题——cap set 问题

给定 \(\mathbb{F}_3^n\)(所有 \(n\)-元组,每项取自 \(\{0, 1, 2\}\))。找出 最大 的子集,使其中任意三个不同元素之和不为零(逐坐标 mod 3)。具有此性质的子集称为 \(\mathbb{F}_3^n\) 中的 cap set

小情形可以算出(\(n \leq 6\) 甚至可以精确证明):

\(n\) 1 2 3 4(SET 卡牌) 5 6
最大 cap set 大小 2 4 9 20 45 112

\(n \geq 7\) 时精确答案 未知,只有上下界。\(n = 8\) 时已知最大显式构造为 496——把它扩大哪怕一个元素都是真正的研究。FunSearch 在 \(n = 8\) 上给出了大小 512 的构造,正是这种研究级进展。

为什么这个问题这么难?空间 \(\mathbb{F}_3^n\) 有 \(3^n\) 个元素,所以子集总数是 \(2^{3^n}\)。\(n = 8\) 时是 \(2^{6561}\),大约 1975 位十进制数。在子集上穷举搜索没指望。你需要 构造 ——一种巧妙的配方,系统性地避开所有三元素之和为零的情形,同时还足够大。这种构造稀少,通常源自深层结构洞察:群论、纠错码、加性数论。

进展的缓慢说明问题。2002 到 2023 年间,\(\mathbb{F}_3^8\) 中已知最大 cap set 仅被人类改进过寥寥几次。FunSearch 2023 年的贡献就在这条缓慢之线上。

为了对这种约束如何咬住你建立 直观感受,自己玩玩最简单的非平凡情形。下面这个小游戏是 \(\mathbb{F}_3^2\) 的 3×3 网格。点击点把它们加入你的 cap set;任何三个已选点的坐标之和为 \((0, 0)\) mod 3 都会被标成红色虚线三角形。\(\mathbb{F}_3^2\) 的最大 cap set 大小是 4——试着找出一个。意外往往在于:选不了 几个 点,就有某条”环绕”三元组突然合上。

Why This Is the Right Yardstick

The cap set problem has three properties that make it nearly ideal as a benchmark for any LLM-driven discovery system:

  1. Mechanically verifiable. A few lines of Python check whether a given subset is a cap set: iterate over triples, check if any sums to zero mod 3. There is no judgment call, no need for a human or another LLM to grade the output.
  2. No closed form. There is no formula to derive the answer from. Every improvement is genuinely new mathematics, not a rederivation.
  3. Connected to deep mathematics. Cap-set sizes control the asymptotic exponent \(\omega\) of matrix multiplication via the Cohn–Umans framework. The problem is also the simplest non-trivial instance of “find dense subsets that contain no three-term arithmetic progressions” — a foundational question in additive combinatorics related to Roth’s theorem.

These three properties — easy to verify, no closed form, connected to deeper structure — are exactly what we’ll want from every benchmark in this post. FunSearch picked the cap set problem as one of its two flagship demonstrations because of them. We’ll see analogues in AlphaEvolve (the Erdős minimum-overlap problem) and TTT-Discover (the same Erdős problem, plus the autocorrelation inequality, GPU kernels, AtCoder, biology). The shape of the problem matters as much as the method.

With this concrete problem in mind, we can now look at how FunSearch, AlphaEvolve, and TTT-Discover try to solve it.

cap set 问题有三个性质,使它几乎是 LLM 驱动发现系统的理想 benchmark:

  1. 机械可验证。 几行 Python 就能检查给定子集是否是 cap set:遍历三元组、检查是否有和为零 mod 3。无需主观判断,无需人类或另一个 LLM 给输出打分。
  2. 无闭式。 没有公式可推导答案。每次改进都是真正的新数学,而非重推。
  3. 与深刻数学相连。 Cap set 的大小通过 Cohn–Umans 框架控制矩阵乘法的渐进指数 \(\omega\)。该问题也是”找出不含三项算术级数的稠密子集”这一类问题中最简单的非平凡实例——与 Roth 定理相关的加性组合学基础问题。

这三条性质——易验证、无闭式、连接更深结构——正是本文每个 benchmark 都会具备的。FunSearch 选 cap set 作为两个旗舰结果之一正因如此。我们会在 AlphaEvolve(Erdős minimum-overlap 问题)与 TTT-Discover(同一 Erdős 问题,加自相关不等式、GPU kernel、AtCoder、生物学)中看到类似形态。问题的形态和方法本身一样重要。

带着这个具体问题,我们现在可以看 FunSearch、AlphaEvolve、TTT-Discover 如何尝试解它。

FunSearch (2023): Searching the Space of Programs

The path from “LLMs as conversational assistants” to “LLMs as discovery engines” was opened by FunSearch (Romera-Paredes et al., Nature, Dec 2023). It established the template every subsequent system — including TTT-Discover — has built on: a frozen pretrained LLM proposes candidate programs, a deterministic evaluator scores them, an evolutionary outer loop selects survivors and stuffs them back into the next prompt. Understanding what FunSearch did, and where its design choices start to bind, is the cleanest entry point to the rest of this post.

从”LLM 作聊天助手”走向”LLM 作发现引擎”的路径由 FunSearch(Romera-Paredes 等,Nature,2023 年 12 月)开启。它确立了之后所有系统——包括 TTT-Discover——都沿用的模板:冻结的预训练 LLM 提出候选程序,确定性 evaluator 打分,外层演化循环挑选幸存者塞回下一次 prompt。理解 FunSearch 做了什么、它的设计选择从哪里开始受限,是进入本文其余部分最干净的入口。

The Idea: Search Over Programs, Not Outputs

The non-obvious move in FunSearch is that the LLM is asked to produce programs that solve the problem, not answers to the problem. For a problem like “find the largest cap set in \(\mathbb{F}_3^n\),” there are at least three plausible search targets:

  1. The answer directly — a literal set of vectors. Hard to mutate locally, hard to characterize diversity over, and the search space scales with the answer size.
  2. A solver — a program that, given \(n\), returns a cap set. Compact, interpretable, runs in milliseconds, gives a clean scalar score.
  3. An informal proof or argument — what GPT-style chain-of-thought defaults to. Hard to verify automatically; needs an LLM-as-judge that itself becomes a noise source.

Why is a program verifiable when chain-of-thought isn’t? Both produce a final answer. The asymmetry is in how you check the answer, not in whether one exists. FunSearch’s deliverable is code: the evaluator runs it and the constructed object (a size-512 cap set, a packing of \(10^6\) items) is generated at evaluation time, then checked mechanically — a single assert confirms “is this a cap set?”. Chain-of-thought’s deliverable is prose, which leaves two bad options: either embed the answer literally in the text (a 512-vector list rarely survives an LLM completion intact, and parsing it back out lands you in option 1 with all its scale problems), or argue informally toward it (now verification needs to follow reasoning, which a deterministic checker cannot do — you fall back to another LLM-as-judge or a strict proof checker like Lean, the narrow path AlphaProof takes). Running code is one line of evaluator; reading prose needs parsing, trust, or both.

FunSearch picks (2). Three properties follow:

  • Compactness. A 20-line Python function can generate arbitrarily large outputs (e.g., a cap set for any \(n\)). Direct-output search over strings of size proportional to the answer cannot.
  • Verifiability. Run the function and compute the score. No subjective judgment, no LLM-as-judge loop, no proof checker required.
  • Interpretability. A mathematician can read the resulting Python and understand what construction was discovered — sometimes well enough to generalize it into a paper-quality proof.

The “Fun” in FunSearch is function: searching in the space of functions. That naming choice telegraphs the central design commitment.

FunSearch 不显然的一步是:让 LLM 生成 解决问题的程序,而不是问题的 答案。对于”在 \(\mathbb{F}_3^n\) 中找最大 cap set”这类问题,至少有三种合理的搜索目标:

  1. 直接搜索答案 ——一组具体向量。难以局部变异,难以衡量多样性,搜索空间随答案规模膨胀。
  2. 搜索求解器 ——给定 \(n\) 返回 cap set 的程序。紧凑、可解释、毫秒级运行、得分干净。
  3. 搜索非形式证明或论证 ——GPT 风格链式思维默认的输出。难以自动验证;需要 LLM-as-judge,而它本身又是噪声源。

同样是给出最终答案,为什么程序可验证、CoT 不行? 差异在 如何检查 答案,而不是是否存在答案。FunSearch 的产物是 代码:evaluator 运行它,构造对象(大小 512 的 cap set、\(10^6\) 物品的装箱)在评估时生成,然后机械检查——一行 assert 就能确认”这是一个 cap set 吗”。链式思维的产物是 prose,只剩两条糟糕路径:要么把答案字面嵌入文本(512 元向量列表极少能在 LLM 一次完成中完整存活,且 parse 回来又退回第 1 种选项,撞上同样的规模问题),要么非形式地论证到答案(验证需要跟随 推理,确定性检查器做不到——只能退回另一个 LLM-as-judge,或严格证明检查器如 Lean,这正是 AlphaProof 走的窄路)。运行代码是一行 evaluator;阅读 prose 需要 parsing、信任,或两者皆需。

FunSearch 选 (2)。由此带来三个性质:

  • 紧凑性。 一个 20 行 Python 函数可生成任意大的输出(如任意 \(n\) 的 cap set)。直接在与答案规模成比例的字符串上搜索做不到这点。
  • 可验证性。 运行函数并计算得分。无主观判断,无 LLM-as-judge 循环,无需证明检查器。
  • 可解释性。 数学家可以读懂生成的 Python,理解发现了什么构造——有时甚至足以泛化为论文级证明。

FunSearch 中的 “Fun” 即 function:在函数空间中搜索。这个命名直白地宣示了它的核心设计承诺。

Architecture: Frozen LLM + Evolutionary Loop + Sandboxed Evaluator

FunSearch’s loop has four components:

  1. Pretrained LLM. DeepMind used Codey, a PaLM 2 variant fine-tuned for code, frozen throughout. Its only role is to take a prompt of existing programs and write a new one.
  2. Programs database. An island-based evolutionary population. Each “island” maintains its own pool of high-scoring programs; periodic inter-island migration prevents premature convergence.
  3. Prompt builder. Selects \(k\) programs from the database (biased toward high scorers), formats them as in-context examples, and asks the LLM to write a better one.
  4. Sandboxed evaluator. Runs the candidate on the target instance, returns a scalar score. Evaluation is fast (milliseconds) and deterministic.

Crucial constraint: FunSearch evolves one function of at most around 20 lines — the “priority function” or “heuristic” that embodies the central decision the search is optimizing. Everything around it (main loop, I/O, scoring) is fixed by the human designer. This narrow target keeps per-evaluation cost tiny and lets the loop process millions of candidates in one experiment. AlphaEvolve and TTT-Discover both relax this constraint (entire files vs. single functions), and the cost structure changes accordingly.

The frozenness of the LLM is the other binding constraint, and one we will keep coming back to. FunSearch’s pretrained Codey weights are identical at iteration 1 and iteration \(10^6\). Whatever skill the system shows is amortized over all the problems Codey was ever pretrained on, plus the in-context examples assembled at runtime.

FunSearch 的循环有四个组件:

  1. 预训练 LLM。 DeepMind 用 Codey,一个为代码微调的 PaLM 2 变体,全程 冻结。它的唯一职责是接收现有程序的 prompt 并写一个新的。
  2. 程序数据库。 Island-based 演化种群。每个”岛”维护自己的高得分程序池;岛间偶尔迁移防止过早收敛。
  3. Prompt 构建器。 从数据库中选 \(k\) 个程序(偏向高得分者),格式化为 in-context 示例,让 LLM 写一个更好的。
  4. 沙盒 evaluator。 在目标实例上运行候选并返回标量得分。评估快(毫秒级)且确定。

关键约束:FunSearch 演化的是 一个函数,最多约 20 行——”priority function”或”heuristic”承载搜索要优化的核心决策。它周围的一切(主循环、IO、打分)由人类设计者固定。这个狭窄目标让单次评估代价极小,使循环能在一次实验中处理数百万候选。AlphaEvolve 与 TTT-Discover 都放宽了这一约束(整文件 vs 单函数),相应地成本结构也变了。

LLM 的冻结是另一个绑死的约束,本文会反复回到这一点。FunSearch 的预训练 Codey 权重在第 1 次迭代与第 \(10^6\) 次迭代完全相同。系统展示出的所有能力,都摊在 Codey 预训练过的所有问题加上运行时拼装的 in-context 示例之上。

Headline Results: Cap Sets and Online Bin Packing

What's online bin packing? (Click to expand)

Online bin packing: items of various sizes arrive one at a time and must be placed immediately, without knowledge of future arrivals, into bins of fixed capacity. Goal: minimize the total number of bins used. The problem is a classical operations-research staple, modeling settings from VM placement in data centers to truck loading.

Standard heuristics include First-Fit (place in the first bin with room), Best-Fit (place in the tightest bin with room), and First-Fit-Decreasing (offline; sort first). Their competitive ratios are well-studied; improving on them under realistic distributions is non-trivial because the "obvious" tweaks tend to break worst-case guarantees.

FunSearch evolved a small priority function that beats First-Fit and Best-Fit by meaningful margins on standard benchmark distributions, and is readable enough that the underlying "waste-aware" decision rule can be explained in a single sentence.

Two flagship results.

Cap set. For multiple values of \(n\), FunSearch produced larger explicit cap sets in \(\mathbb{F}_3^n\) than any previously known construction. For \(n = 8\) it produced a cap set of size 512. This is a real advance: cap-set lower bounds had been a slow-moving target in additive combinatorics, with progress measured in rare exact constructions, and FunSearch contributed one. The discovered Python function was short enough that mathematicians could read it, simplify it, and study what symmetry it was exploiting.

Online bin packing. FunSearch evolved a priority function that outperforms First-Fit and Best-Fit on multiple benchmark distributions, with a decision rule readable enough to deploy. This was the first demonstration that an LLM-driven search could discover practical algorithmic heuristics whose performance is verified by running on real workloads — not merely solve an artificial puzzle.

The two results matter for different reasons. Cap sets show the system can advance a research-level open problem. Bin packing shows it can produce practical artifacts. Both are made possible by a frozen LLM — there is no learning happening, only sampling from a fixed prior plus selection. The size of that prior and the cleverness of the prompt builder do the work. This is the design point AlphaEvolve and then TTT-Discover would push against in opposite directions.

什么是在线装箱(online bin packing)?(点击展开)

在线装箱:不同大小的物品逐一到达,必须立即放入容量固定的箱子,不知道未来到达。目标:最小化箱子总数。该问题是运筹学经典问题,建模从数据中心 VM 放置到卡车装载等场景。

标准启发式包括 First-Fit(放入第一个能装下的箱子)、Best-Fit(放入最紧的能装下的箱子)、First-Fit-Decreasing(离线;先排序)。它们的竞争比研究充分;在现实分布上改进它们并不容易,因为"显然"的微调往往破坏最坏情况保证。

FunSearch 演化出一个小型 priority function,在标准 benchmark 分布上以显著幅度击败 First-Fit 与 Best-Fit——而且可读性足以让其背后的"waste-aware"决策规则用一句话解释。

两个旗舰结果。

Cap set。 对多个 \(n\),FunSearch 在 \(\mathbb{F}_3^n\) 中产生了比此前任何构造都大的显式 cap set。\(n = 8\) 的结果是大小 512 的 cap set。这是 真实 的进展:加性组合学的 cap-set 下界一直是缓慢移动的目标,进展用稀有的精确构造衡量,FunSearch 贡献了一个。发现的 Python 函数足够短,数学家可以阅读、简化它,并研究它在利用什么对称性。

在线装箱。 FunSearch 演化出的 priority function 在多个 benchmark 分布上超过 First-Fit 与 Best-Fit,决策规则可读且可被部署。这是首次证明 LLM 驱动的搜索能发现 实用 的算法启发式,其性能可在真实负载上验证——不仅是解一个人造谜题。

两个结果意义不同。Cap set 显示系统能推进研究级开放问题。装箱显示它能产出实用工件。两者都建立在 冻结 LLM 之上——没有学习发生,只有从固定先验采样加选择。先验的体量与 prompt 构建器的巧妙做了所有工作。这正是 AlphaEvolve 与之后的 TTT-Discover 从相反方向去推动的设计点。

Why FunSearch Works: Verify Is Easy, Construct Is Hard

Two complementary properties make cap set the right problem for an LLM-evolutionary loop. Together they let a noisy proposer beat decades of careful human work.

Verification is cheap and exact. Given any subset \(S \subseteq \mathbb{F}_3^n\), checking whether \(S\) is a cap set takes a few lines of Python:

import itertools

def is_cap_set(S, n):
    """Return True iff no 3 elements of S sum to (0,...,0) mod 3."""
    for a, b, c in itertools.combinations(S, 3):
        if all((a[k] + b[k] + c[k]) % 3 == 0 for k in range(n)):
            return False
    return True

That’s it. No subjective judgment, no LLM-as-judge, no learned verifier. The check runs in \(O(\vert S\vert^3 \cdot n)\) time — polynomial in the size of the answer, regardless of how clever the construction was. The evaluator runs this in milliseconds on a candidate and emits a clean numerical score (typically: \(\vert S \vert\) if valid, else 0).

Construction is exponentially hard. \(\mathbb{F}_3^n\) has \(3^n\) elements and \(2^{3^n}\) subsets. For \(n = 8\) that is \(2^{6561}\) — enumeration is hopeless. Worse, naive greedy construction (“add the next compatible element”) plateaus quickly: the largest cap sets require non-local choices, where committing to a slightly worse early element opens up a much larger valid extension later. Humans have produced cap-set lower bounds slowly over decades by inventing constructions from algebraic geometry, group-theoretic symmetries, and error-correcting codes — each requiring years of domain expertise to find a single new construction worth publishing.

This is the canonical NP-style asymmetry: a valid construction is poly-time checkable, but finding one (let alone an optimal one) is exponential in the worst case. Most interesting scientific-discovery problems share this shape — theorems are easy to verify in Lean, programs are easy to run, kernels are easy to benchmark, but inventing the artifact is hard.

What FunSearch outsources. Crucially, FunSearch does not enumerate cap sets or try to “understand” them mathematically. It evolves a small priority(el) function that scores individual elements of \(\mathbb{F}_3^n\). A fixed greedy driver then builds the cap set by taking elements in priority order, rejecting any that would form a line with two already-chosen elements:

import itertools

# === The ~20-line function FunSearch EVOLVES. ===
# The published priority function for n=8 sums several modular-arithmetic
# terms; the form below sketches its structure. See Romera-Paredes et al.
# 2023, Extended Data Fig. 3, for the exact code.
def priority(el: tuple[int, ...], n: int) -> float:
    score = 0.0
    for i in range(n):
        score += LINEAR_COEFF[i][el[i]]                      # per-coordinate
    for i in range(n):
        for j in range(i + 1, n):
            score += PAIR_COEFF[i][j][el[i]][el[j]]          # pairwise
    # ... a few more terms; the coefficients embed the discovered structure
    return score

# === The FIXED driver. FunSearch does NOT evolve this. ===
def build_cap_set(n: int) -> list[tuple[int, ...]]:
    elements = list(itertools.product(range(3), repeat=n))
    elements.sort(key=lambda e: -priority(e, n))             # highest first
    chosen: list[tuple[int, ...]] = []
    for e in elements:
        # Reject if e completes a line with any pair already chosen.
        forms_line = any(
            all((a[k] + b[k] + e[k]) % 3 == 0 for k in range(n))
            for a in chosen for b in chosen if a < b
        )
        if not forms_line:
            chosen.append(e)
    return chosen

All the intellectual content lives in priority. The LLM proposes candidate priority functions; the evaluator runs build_cap_set and reports len(chosen); the evolutionary loop keeps the high-scoring ones and asks the LLM for variations. Over millions of evaluations, the search converges on priority functions whose induced greedy ordering produces unusually large cap sets — for \(n=8\), FunSearch’s discovered priority function produces a cap set of size 512.

Why this division of labor matters. Humans hit a cognitive wall when asked “how should I score an element to bias the greedy toward large constructions?” The question has millions of plausible answers, no closed form for the best one, and no obvious symmetry to exploit. A mathematician trying to write the priority function from scratch would think in terms of group orbits, linear codes, or quadratic forms — but the best priority function FunSearch finds may not match any of those: it’s a list of coefficients tuned through millions of greedy runs, each scored objectively. The LLM does not think mathematically; it samples short Python functions, each one a guess. Most are bad. A few are surprisingly good. The evaluator separates them cheaply.

The core asymmetry. Cheap to test, intractable to design. This is what lets a noisy LLM proposer plus a deterministic verifier discover what no human could write directly. AlphaEvolve and TTT-Discover both inherit this same asymmetry — on bigger problems (full code files, GPU kernels, biology pipelines) and with bigger budgets, but the underlying logic is identical.

两个互补的性质让 cap set 成为 LLM-演化循环的理想问题。它们结合起来,让一个嘈杂的提议器击败了几十年的人类细致工作。

验证既便宜又精确。 给定任何子集 \(S \subseteq \mathbb{F}_3^n\),检查它是否是 cap set 只需几行 Python:

import itertools

def is_cap_set(S, n):
    """S 是 cap set 当且仅当 S 中任 3 元素之和不为 (0,...,0) mod 3。"""
    for a, b, c in itertools.combinations(S, 3):
        if all((a[k] + b[k] + c[k]) % 3 == 0 for k in range(n)):
            return False
    return True

完事。无主观判断、无 LLM-as-judge、无学习的验证器。检查在 \(O(\vert S\vert^3 \cdot n)\) 时间内完成——对答案规模 多项式,与构造的巧妙程度无关。Evaluator 在候选上毫秒级运行,吐出干净的数值分数(通常:合法则 \(\vert S \vert\),否则 0)。

构造指数难。 \(\mathbb{F}_3^n\) 有 \(3^n\) 个元素,\(2^{3^n}\) 个子集。\(n = 8\) 时是 \(2^{6561}\)——枚举无望。更糟,朴素 贪心(加上下一个相容元素)很快停滞:最大 cap set 需要 非局部 选择——早期承诺一个稍差的元素,可能在之后开出大得多的有效扩展。人类几十年来通过发明来自代数几何、群论对称性、纠错码的构造缓慢推进 cap-set 下界——每个新构造都要数年领域积累。

这是经典的 NP 式不对称:一个有效构造可以多项式时间 检验,但 找到 一个(更别说最优的)最坏情况下是指数的。大多数有趣的科学发现问题都有这种形态——定理在 Lean 中易于检查、程序易于运行、kernel 易于 benchmark,但发明这个工件本身难。

FunSearch 外包给谁。 关键的是,FunSearch 枚举 cap set,也不试图在数学意义上”理解”它们。它演化一个小的 priority(el) 函数,给 \(\mathbb{F}_3^n\) 的单个元素打分。一个 固定的 贪心驱动器按优先级顺序取元素,跳过任何会与两个已选元素共线的元素:

import itertools

# === FunSearch 演化的、约 20 行的函数。===
# 发表的 n=8 priority 函数由若干模运算项相加;下面的形式给出其结构骨架。
# 准确代码见 Romera-Paredes 等 2023 年论文 Extended Data Fig. 3。
def priority(el: tuple[int, ...], n: int) -> float:
    score = 0.0
    for i in range(n):
        score += LINEAR_COEFF[i][el[i]]                      # 单坐标项
    for i in range(n):
        for j in range(i + 1, n):
            score += PAIR_COEFF[i][j][el[i]][el[j]]          # 成对交互项
    # ... 还有几项;这些系数承载了搜索发现的结构
    return score

# === 固定的驱动器。FunSearch *不* 演化这部分。===
def build_cap_set(n: int) -> list[tuple[int, ...]]:
    elements = list(itertools.product(range(3), repeat=n))
    elements.sort(key=lambda e: -priority(e, n))             # 优先级高的先
    chosen: list[tuple[int, ...]] = []
    for e in elements:
        # 若 e 与已选两元素能合成一条线,则拒绝。
        forms_line = any(
            all((a[k] + b[k] + e[k]) % 3 == 0 for k in range(n))
            for a in chosen for b in chosen if a < b
        )
        if not forms_line:
            chosen.append(e)
    return chosen

所有智力内容都在 priority 里。LLM 提出候选 priority 函数;evaluator 运行 build_cap_set 报告 len(chosen);演化循环保留高分的,让 LLM 给出变体。数百万次评估后,搜索收敛到能让贪心排序产生异常大 cap set 的 priority 函数——对 \(n=8\),FunSearch 发现的 priority 函数产生大小 512 的 cap set。

这个分工为什么关键。 人类被问到 “我应该怎样给元素打分来偏向贪心,让它构造出大的 cap set?” 时撞到认知墙。问题有数百万种貌似合理的答案、最佳答案无闭式、无明显对称性可用。试图 从零写 priority 函数的数学家会从群轨道、线性码、二次型思考——但 FunSearch 找到的 最佳 priority 函数可能不匹配其中任何一种:它是一串系数,通过数百万次贪心运行客观打分调出来的。LLM 不像数学家那样推理;它采样短 Python 函数,每个都是一次猜测。大多数差。少数惊喜地好。Evaluator 廉价地把它们分开。

核心不对称。 测试便宜,设计难解。这就是嘈杂的 LLM 提议器加确定性验证器能发现人类无法直接写出的东西的根本原因。AlphaEvolve 与 TTT-Discover 都继承同一不对称——在更大问题(整文件、GPU kernel、生物学管线)上用更大预算,但底层逻辑相同。

AlphaEvolve (2025): Scaling the Template

AlphaEvolve (DeepMind, 2025) inherits FunSearch’s shape and scales every axis of it: bigger models, longer programs, richer evaluator feedback, and the operational discipline to run all of this against problems that are economically meaningful at Google’s scale. The combination unlocks results that FunSearch’s constraints made out of reach.

AlphaEvolve(DeepMind,2025)继承 FunSearch 的形状并在每个维度上扩展:更大的模型、更长的程序、更丰富的 evaluator 反馈、以及让这一切跑在 Google 规模上经济上有意义的问题上的工程纪律。这一组合解锁了 FunSearch 的约束所触不到的结果。

What Changed: Functions to Files, One Model to Two

Three architectural choices distinguish AlphaEvolve from FunSearch:

Two-tier LLM ensemble. A single run uses both Gemini 2.0 Flash (fast, cheap, generates the bulk of proposals) and Gemini 2.5 Pro (slower, smarter, used occasionally to seed harder candidates). Pure 2.5 Pro would be too expensive to run at evolutionary throughput; pure 2.0 Flash misses the deeper moves. The ensemble gets most of the throughput at Flash cost while keeping Pro’s creativity available where it counts.

Evolution at code-file granularity. Where FunSearch evolved single ≤20-line Python functions, AlphaEvolve evolves entire code files of hundreds of lines in any language. Candidates can carry non-trivial state, helper functions, even small frameworks. The trade-off is that each evaluation may take hours rather than milliseconds — but for high-value problems (kernel speedups, scheduler heuristics), that’s still worth it.

Parallel evaluators with rich feedback. Each candidate is dispatched to a parallel evaluator pool. The evaluator returns not just a scalar score but per-objective signals the next prompt can reference. The outer loop is still island-based evolution inherited from FunSearch, but populations and migration patterns are scaled up.

The cost asymmetry vs. FunSearch is the obvious downside: AlphaEvolve’s per-evaluation cost is orders of magnitude higher, so total candidates explored is orders of magnitude lower. The bet is that the richer search space (full files) and richer feedback (per-objective signals) more than compensate.

三个架构选择把 AlphaEvolve 与 FunSearch 区分开:

双层 LLM 集成。 单次运行同时使用 Gemini 2.0 Flash(快、便宜,生成大部分提议)与 Gemini 2.5 Pro(慢、聪明,偶尔种入更难的候选)。纯 2.5 Pro 太贵无法支撑演化吞吐量;纯 2.0 Flash 又会错过较深的招数。集成让系统以 Flash 成本获得大部分吞吐量,同时在关键处仍可调用 Pro 的创造力。

代码文件粒度的演化。 FunSearch 演化的是单个 ≤20 行的 Python 函数,AlphaEvolve 演化的是任何语言的整份代码文件(数百行)。候选可以携带非平凡状态、辅助函数、甚至小型框架。代价是每次评估可能耗时数小时而非毫秒——但对高价值问题(kernel 加速、调度启发式),仍然值得。

并行 evaluator 与丰富反馈。 每个候选被分发到并行 evaluator 池。Evaluator 不仅返回标量分数,还返回逐目标信号供下次 prompt 引用。外层循环仍是继承自 FunSearch 的 island-based 演化,但种群与迁移模式被放大。

相比 FunSearch 的明显代价是:AlphaEvolve 的单次评估成本高几个数量级,故总探索候选数也低几个数量级。赌注是更丰富的搜索空间(整文件)与更丰富的反馈(逐目标信号)能补偿之。

Benchmark Results: Math and Infrastructure

What does this machinery actually produce? The headline results across math and infrastructure:

Domain Problem Prior best AlphaEvolve Direction
Algorithm discovery 4×4 complex matrix mult 49 mults (Strassen 1969) 48 mults Fewer is better
Extremal combinatorics Erdős minimum-overlap 0.380927 (2016) 0.380924 Lower is better
Discrete geometry Kissing number / packing Various human records Improved in several settings Better bound
ML systems Gemini training kernel Hand-tuned baseline ~23% speedup Faster is better
Cluster scheduling Borg scheduling heuristic Existing production ~0.7% compute recovered Higher recovery
TPU hardware Verilog circuit Existing design Reduced gate count Fewer gates better
What's Strassen's algorithm, and why is 48 mults a big deal? (Click to expand)

Volker Strassen showed in 1969 that the product of two 2×2 matrices can be computed with just 7 scalar multiplications (and more additions), not the obvious 8. Recursing this insight via block decomposition on n×n matrices gives runtime O(nlog2 7) ≈ O(n2.807) — the first algorithm beating the cubic O(n3) baseline. Applied once more to 4×4 = (2×2)×(2×2) the recursion gives 7×7 = 49 scalar multiplications.

That 49-mult count for 4×4 over the complex numbers stood unchallenged for 56 years, until AlphaEvolve found 48 in 2025. Earlier in 2022, AlphaTensor had already found a 47-mult algorithm for the 4×4 case — but only over the field of two elements ℤ2, not over the reals or complex numbers, so it does not apply to standard linear algebra.

The math results matter as proofs of capability: extremal problems with crisp, verifiable answers, on which AlphaEvolve unambiguously beat the human-mathematician state of the art (e.g., a 56-year-old Strassen record). The production results matter as proofs of value: deployed at Google scale, returning real compute and silicon savings. Both create non-trivial benchmarks for any successor system to target — including TTT-Discover, which directly reuses the Erdős minimum-overlap problem to claim a head-to-head win.

这套机器实际产出什么?数学与基础设施上的标志性结果:

领域 问题 此前最佳 AlphaEvolve 方向
算法发现 4×4 复矩阵乘法 49 次乘法 (Strassen 1969) 48 次 越少越好
极值组合学 Erdős minimum-overlap 0.380927 (2016) 0.380924 越低越好
离散几何 吻接数 / 装箱界 多种人类记录 在多种设定中改进 更优界
ML 系统 Gemini 训练 kernel 手工调优基线 ~23% 加速 越快越好
集群调度 Borg 调度启发式 现有生产 回收约 0.7% 算力 回收越多越好
TPU 硬件 Verilog 电路 现有设计 减少门数 越少越好
什么是 Strassen 算法?48 次乘法为何意义重大?(点击展开)

Volker Strassen 在 1969 年证明:两个 2×2 矩阵的乘积可以用 7 次标量乘法完成(加法更多),而不是显然的 8 次。通过分块递归把这一洞察应用到 n×n 矩阵上,给出运行时 O(nlog2 7) ≈ O(n2.807)——首个击败 O(n3) 立方基线的算法。再递归一次得到 4×4 = (2×2)×(2×2),即 7×7 = 49 次标量乘法

4×4 复矩阵 49 次乘法的记录无人撼动了 56 年,直到 2025 年 AlphaEvolve 找到 48 次。更早 2022 年 AlphaTensor 已经为 4×4 找到 47 次乘法算法——但仅适用于二元域 ℤ2,不适用于实数或复数,所以无法用于标准线性代数。

数学结果是 能力 的证明:有清晰、可验证答案的极值问题,AlphaEvolve 在其上明确击败了人类数学家的最强水平(如打破 56 年的 Strassen 记录)。生产结果是 价值 的证明:部署在 Google 规模上,真实地省回了算力与硅。两者都是任何继承系统必须挑战的非平凡基准——包括 TTT-Discover,它直接沿用 Erdős minimum-overlap 问题来做正面对比。

The Frozen-Weight Ceiling

Through AlphaEvolve’s entire pipeline, the LLM itself does not learn. Its weights at problem 1000 are bitwise identical to its weights at problem 1, just as FunSearch’s Codey weights were. All problem-specific behavior comes from in-context conditioning: the prompt for proposal \(N+1\) contains carefully-curated examples from \(N\). This is enormously powerful — frontier LLMs have huge implicit priors that the evolutionary loop can amplify — but it is also a ceiling.

Once the context budget is exhausted, the model cannot get better at this problem; the only remaining knob is running the loop longer and hoping a sample lands. Empirically, AlphaEvolve runs show diminishing returns after some number of generations on hard problems: the best score stops improving and the wall-clock cost per marginal advance balloons. The remedy that AlphaEvolve does not attempt is to update the model. TTT-Discover’s bet is precisely that moving the LLM’s weights themselves — modestly, RL-style, on the test problem — clears that ceiling at acceptable cost.

在 AlphaEvolve 整个流水线中,LLM 自身不学习。它在问题 1000 处的权重与问题 1 处完全相同,正如 FunSearch 的 Codey 权重那样。所有问题特定的行为都来自 in-context 条件化:提议 \(N+1\) 的 prompt 包含来自 \(N\) 的精心筛选示例。这极其强大——前沿 LLM 有巨大的隐式先验,演化循环可以放大它们——但也是天花板。

一旦 context 预算耗尽,模型不能再 对这个问题变得更好;剩下的唯一旋钮是把循环跑得更久,赌某个样本落中。经验上,AlphaEvolve 在难题上若干代后显示边际递减:最佳得分停止提升,每次边际进步的 wall-clock 成本急剧上升。AlphaEvolve 没有 尝试的补救是更新模型。TTT-Discover 的赌注正是 直接挪动 LLM 自己的权重——RL 式,温和地,在测试问题上——能否以可承受的代价打破这个天花板。

A Second Warm-Up: GPU Kernel Design

TTT-Discover’s most operationally impressive results are on GPU kernels — beating hand-tuned human champions on four different chips. To appreciate why that’s a hard problem (and what TTT-Discover’s RL loop is actually searching over), we need a short tour of how GPU kernels become fast. The same warm-up-then-method pattern as the cap-set section: this part introduces the problem; the TTT-Discover section that follows shows the result.

TTT-Discover 在 GPU kernel 上的结果操作上最令人印象深刻——在四种不同芯片上击败手工调优的人类冠军。为了体会 为什么 这是个难题(以及 TTT-Discover 的 RL 循环到底在搜索什么),我们需要简短介绍一下 GPU kernel 是怎么变快的。和 cap set 那一节一样的”先热身再方法”模式:这一节介绍问题,接下来的 TTT-Discover 节展示结果。

What Is a GPU Kernel, and Why Is Making It Fast Hard?

A GPU kernel is a small program that runs in parallel across many threads on a GPU. A modern NVIDIA H100 has 132 streaming multiprocessors (SMs), each running thousands of threads concurrently — so a single kernel launch can have on the order of \(10^6\) thread instances in flight at once. The GPU schedules them across SMs automatically.

For matrix multiplication \(C = A \cdot B\) with \(C\) of size \(M \times N\), the simplest possible kernel — the naive kernel — assigns one thread per output element. Thread \((i, j)\) reads row \(i\) of \(A\), column \(j\) of \(B\), accumulates a \(K\)-long dot product, and writes \(C_{ij}\). Correctness is trivial; performance is terrible. Modern GPUs reach their headline TFLOPS only when the compute units are fed fast enough — and reading from HBM (high-bandwidth memory) is much slower per byte than computing on already-loaded data. The naive kernel re-reads each element of \(A\) and \(B\) many times, the SMs spend most of their cycles waiting on memory, and the achieved throughput is typically under 5% of peak.

The entire subfield of GPU kernel optimization is about closing this gap.

GPU kernel 是在 GPU 上跨大量线程并行运行的小程序。一块现代 NVIDIA H100 有 132 个 流式多处理器(SM),每个 SM 同时运行数千线程——所以一次 kernel 启动可以同时并发约 \(10^6\) 个线程实例。GPU 自动把它们调度到各 SM 上。

对矩阵乘法 \(C = A \cdot B\)(\(C\) 大小为 \(M \times N\)),最简单的 kernel —— naive kernel ——给每个输出元素分配一个线程。线程 \((i, j)\) 读 \(A\) 的第 \(i\) 行、\(B\) 的第 \(j\) 列,累加 \(K\) 长的点积,写回 \(C_{ij}\)。正确性显然,性能则惨不忍睹。现代 GPU 只有当计算单元 喂得够快 才能达到标榜的 TFLOPS——而从 HBM(高带宽内存)读字节比对已加载数据做计算 慢得多。Naive kernel 每个 \(A\)、\(B\) 元素读很多次,SM 大部分周期在等内存,实际吞吐通常不到峰值的 5%。

整个 GPU kernel 优化的子领域就是为了关掉这个差距。

The Fundamental Tension: Compute vs Memory

Every kernel sits somewhere on a roofline:

  • Memory-bound regime. Throughput is limited by how fast data arrives from HBM. To go faster, reduce bytes-moved per useful op.
  • Compute-bound regime. Throughput is limited by how many FLOPs the SMs can do per cycle. To go faster, increase parallelism or use higher-throughput math units (tensor cores).

The dividing line is the kernel’s arithmetic intensity: FLOPs done per byte loaded. The crossover sits at \((\text{peak FLOPS}) \,/\, (\text{peak bandwidth})\). For an H100 with ~50 TFLOPS fp32 and ~3 TB/s memory bandwidth, that’s about 17 FLOPs/byte. Below 17, any further compute optimization is wasted; above 17, any further bandwidth optimization is wasted. Every well-tuned kernel pushes its arithmetic intensity firmly above the crossover. How depends on the operation. For matmul, the dominant tool is tiling.

每个 kernel 都坐落在 roofline 的某处:

  • 内存受限区。 吞吐受数据从 HBM 到达的速度限制。要更快,减少每次有用算子搬运的字节。
  • 算力受限区。 吞吐受 SM 每周期能做多少 FLOPs 限制。要更快,增加并行度或使用更高吞吐的计算单元(tensor cores)。

分界线是 kernel 的 算术强度:每加载一字节做多少 FLOPs。交叉点在 \((\text{峰值 FLOPS}) \,/\, (\text{峰值带宽})\)。H100 约 50 TFLOPS fp32 和约 3 TB/s 内存带宽,交叉点约 17 FLOPs/byte。低于 17 时,进一步的算力优化都是浪费;高于 17 时,进一步的带宽优化都是浪费。每个调好的 kernel 都把算术强度推到远超交叉点。怎么做 取决于运算。对矩阵乘法,主要工具是 tiling(分块)

Tile Size: The Key Knob

Instead of one output element per thread, organize threads into blocks that cooperatively compute a \(T \times T\) tile of \(C\). The block first loads a strip of \(A\) (T rows, K columns) and a strip of \(B\) (K rows, T columns) into shared memory — on-chip SRAM, roughly 100× faster than HBM. The threads in the block then compute the entire \(T \times T\) tile from shared memory only, reusing each loaded element \(T\) times.

For each \(T \times T\) output tile:

  • HBM bytes read: \(4 \cdot 2 T K = 8 T K\) (fp32)
  • FLOPs done: \(2 T^2 K\) (one multiply + one add per \(T \times T \times K\) outer-product)
  • Arithmetic intensity: \(T / 4\) FLOPs/byte

So doubling the tile size doubles the arithmetic intensity. Going from \(T = 1\) (the naive kernel) to \(T = 64\) takes you from 0.25 FLOPs/byte to 16 FLOPs/byte — right at the H100 crossover.

But \(T\) cannot grow without bound. Shared memory per block is only 48–228 KB, and each block needs \(4(T^2 + 2TK)\) bytes for the output tile and input strips. The register file and the total threads-per-block both impose further ceilings. If blocks become too big, you launch too few of them in parallel and lose occupancy — SMs sit idle for lack of work to schedule. So choosing \(T\) is a balancing act: large enough to be compute-bound, small enough to fit and to keep the SMs busy.

Try it yourself. The widget below models a simple square-tile matmul against representative GPU parameters (50 TFLOPS compute, 3 TB/s bandwidth, 48 KB shared memory per block). Slide \(T\) from 1 (the naive kernel) up through 128 (oversized) and watch the user point trace along the roofline.

The interesting band is roughly \(T \in [64, 96]\): large enough to be compute-bound, small enough to fit. Outside that band you’re leaving performance on the table. Real production kernels go much further — multi-level tiling (warp / thread / block), software pipelining, tensor-core packing, hand-tuned warp specialization, branch reduction, instruction-level scheduling — but the tile-size tradeoff is the first lever any optimizing kernel pulls.

不是一个线程算一个输出元素,而是把线程组织成 block,协作计算 \(C\) 的一个 \(T \times T\) tile。Block 先把 \(A\) 的一条带(T 行、K 列)和 \(B\) 的一条带(K 行、T 列)加载到 共享内存 ——片上 SRAM,比 HBM 快约 100 倍。然后 block 内的线程只用共享内存就能计算整个 \(T \times T\) tile,每个加载的元素被复用 \(T\) 次。

对每个 \(T \times T\) 输出 tile:

  • HBM 字节读取:\(4 \cdot 2 T K = 8 T K\)(fp32)
  • FLOPs 完成:\(2 T^2 K\)(每次 \(T \times T \times K\) 外积中一次乘加)
  • 算术强度:\(T / 4\) FLOPs/byte

所以 tile size 翻倍,算术强度翻倍。从 \(T = 1\)(naive kernel)到 \(T = 64\),算术强度从 0.25 提到 16 FLOPs/byte——恰好踩在 H100 的交叉点。

但 \(T\) 不能无限大。每 block 的共享内存只有 48–228 KB,每 block 需要 \(4(T^2 + 2TK)\) 字节存输出 tile 与输入带。寄存器文件与每 block 总线程数是进一步的上限。block 太大就并发不够多,丢掉 occupancy(占用率) ——SM 因为没活可调度而空闲。所以选 \(T\) 是平衡术:足够大达到算力受限,足够小能放下并保持 SM 忙。

自己玩玩看。下面这个 widget 模拟简单的方形 tile 矩阵乘法,参数对应代表性 GPU(50 TFLOPS 算力、3 TB/s 带宽、每 block 48 KB 共享内存)。把 \(T\) 从 1(naive kernel)滑到 128(过大),看用户点在 roofline 上的轨迹。

有意思的区间大约是 \(T \in [64, 96]\):足够大达到算力受限,足够小能放下。这之外性能在浪费。真实生产 kernel 走得远得多——多级 tiling(warp / thread / block)、软件流水、tensor-core 打包、手调 warp 专化、分支削减、指令级调度——但 tile-size 这个权衡是任何优化 kernel 拉的第一根杠杆。

Why TriMul Is Special (and What TTT-Discover Optimizes)

The TriMul kernel is matmul with a twist: one input is restricted to upper- or lower-triangular, so half its entries are known to be zero. This pattern is ubiquitous in transformer training — the causal attention mask makes \(Q K^\top\) effectively triangular — so a fast TriMul translates directly into pretraining speedups for every causal-LM stack.

A naive kernel doesn’t know about the zeros. It loads them, multiplies by them, adds them, and wastes half the FLOPs. A specialized kernel must:

  • Skip dead loads. Don’t fetch elements known to be zero — recover the lost bandwidth.
  • Compress storage. Pack the triangular matrix into half the space — doubles effective bandwidth again.
  • Branch-align tiles. Pick tile dimensions so each tile lies entirely above, below, or on the diagonal — never straddling, which forces warps to branch (all threads in a warp must take the same control path, so divergent branches serialize).
  • Pipeline across the triangle. Reorder block iteration so warp issue rates stay high in the “thin” rows/columns near the diagonal.

These optimizations interact. The tile size that maximizes arithmetic intensity may put tile boundaries on the wrong side of the diagonal. The compressed layout that saves bandwidth may force ugly strided access patterns. Picking the best combination is what a competitive GPU engineer does over weeks of profiling.

This is the design space TTT-Discover walks. The system inherits an LLM’s prior knowledge of CUDA/Triton idioms, and uses RL to specialize a kernel for a specific TriMul shape on a specific chip. Each chip generation has different tradeoffs — H100 has tensor cores and 228 KB shared memory per block, A100 has 164 KB, AMD MI300x has wave-level instead of warp-level execution, B200 has yet another set of changes — so the optimal kernel for each chip is a different small program. The headline result you’ll see in the TTT-Discover section — beating GPUMode champions on four chip generations with one recipe — is impressive precisely because the same RL loop found a different optimum for each chip, with no human re-tuning.

TriMul kernel 是带扭曲的矩阵乘法:一个输入限制为上三角或下三角,所以一半元素 已知 为零。这种模式在 transformer 训练中无处不在——因果 attention mask 让 \(Q K^\top\) 在效果上三角——所以快的 TriMul 直接转化为每个 causal LM 训练栈的预训练加速。

Naive kernel 不知道这些零。它照旧加载、乘、加,浪费一半 FLOPs。 特化 kernel 必须:

  • 跳过死加载。 不取已知为零的元素——把丢掉的带宽找回来。
  • 压缩存储。 把三角矩阵打到一半空间——等效带宽再翻一倍。
  • Tile 对齐分支。 选 tile 维度让每个 tile 完全 在对角线之上、之下或恰在对角线上——绝不 跨越,因为跨越会迫使 warp 分支(warp 内所有线程必须走同一控制路径,分歧分支会串行化)。
  • 沿三角形流水。 重排 block 迭代顺序,让靠近对角线的”细”行/列保持高 warp 发射率。

这些优化 相互影响。算术强度上最优的 tile size 可能让 tile 边界落在对角线错误一侧。省带宽的压缩布局可能强制丑陋的跨步访问。挑出最佳组合就是一位 GPU 工程师 profile 几周的活。

这就是 TTT-Discover 走的设计空间。系统继承 LLM 关于 CUDA/Triton 写法的先验,同时 用 RL 针对 特定 TriMul 形状在 特定 芯片上特化 kernel。每代芯片有不同的权衡——H100 有 tensor core、每 block 228 KB 共享内存,A100 有 164 KB,AMD MI300x 用 wave-level 而非 warp-level 执行,B200 又有另一组变化——所以每个芯片的最优 kernel 是 一个不同的小程序。你将在 TTT-Discover 节看到的标志性结果——用同一份配方在四代芯片的 GPUMode 上击败冠军——之所以惊人,正是因为同一份 RL 循环为每个芯片找到了不同的最优,无需人工重调。

TTT-Discover (2026): Training at Inference Time

TTT-Discover keeps everything AlphaEvolve does outside the LLM — the evaluator, the prompt-building, the buffer of past programs — and changes one thing inside it: at every step, the LLM’s weights are updated by reinforcement learning, using the evaluator’s scores as reward. The rest of this section walks through what that one change buys, how it’s implemented, and how it performs on four very different problem domains.

TTT-Discover 保留了 AlphaEvolve 在 LLM 外部 做的一切——evaluator、prompt 构建、过往程序的 buffer——只改了 LLM 内部 的一件事:每一步用 evaluator 的分数当奖励,通过强化学习更新 LLM 权重。本节其余部分讲清这一改动买到了什么、如何实现、在四个差异极大的问题领域上表现如何。

Test-Time Training in One Paragraph

Test-time training (TTT), introduced by Sun, Wang, Liu, Miller, Efros, Hardt at ICML 2020, takes a single test instance and runs a few gradient steps of a self-supervised loss on that instance before predicting. The thesis: each test instance defines its own learning problem. In 2020 the auxiliary loss was image-rotation prediction. In the 2024 sequence-modeling revival (TTT-Linear / TTT-MLP), it became an RNN inner state whose recurrence is itself a gradient step. TTT-Discover is the third act of the same idea: the “test instance” is now a scientific problem, the “auxiliary loss” is now an RL reward computed by the evaluator, and the “few gradient steps” become hundreds of RL updates on a single problem. The mechanism — train at test time, not just predict — is unchanged.

The single distinction from AlphaEvolve. AlphaEvolve: frozen LLM + evolutionary outer loop. TTT-Discover: the LLM’s weights are updated by RL using the evaluator’s scores as reward; the next batch of proposals is drawn from the updated model. Everything else is shared.

Test-time training (TTT) 由 Sun, Wang, Liu, Miller, Efros, Hardt 在 ICML 2020 提出:取单个测试实例,在其上做几步自监督 loss 的梯度,再做预测。核心论点:每个测试实例都定义自己的学习问题。2020 年的辅助 loss 是图像旋转预测;2024 年序列建模复兴版本(TTT-Linear / TTT-MLP) 中变成 RNN 内层状态,其递推本身就是一步梯度。TTT-Discover 是同一思想的第三幕:”测试实例”现在是 科学问题,”辅助 loss”现在是 evaluator 计算的 RL 奖励,”几步梯度”变成对单一问题数百步 RL 更新。机制——在测试时训练,而非只预测——没有改变。

与 AlphaEvolve 唯一的区别。 AlphaEvolve:冻结 LLM + 外层演化循环。TTT-Discover:用 evaluator 分数当奖励,通过 RL 更新 LLM 权重;下一批提议从更新后的模型采样。其余完全共享。

The Core Loop

For each problem \(P\) paired with an evaluator \(E\), TTT-Discover instantiates a fresh copy of a pretrained policy \(\pi_{\theta_0}\) (the paper uses gpt-oss-120b, OpenAI’s open-source 120B-parameter model) and iterates:

init:   policy π_θ = gpt-oss-120b
        problem P with evaluator E
        rollout buffer B = ∅

for step i = 0, 1, 2, ..., N:
    # 1. PROPOSE: sample candidates from current policy
    candidates = π_θ.sample(P, batch_size=k)
    # 2. SCORE: deterministic evaluator
    rewards = E(candidates)
    B.extend(zip(candidates, rewards))
    # 3. TRAIN: RL update on the rollouts
    θ ← rl_update(θ, B)      # ← the TTT step
    # 4. KEEP BEST so far
    best = max(best, max(rewards))

return best

Single-turn rollouts, multi-step training. Each call to π_θ.sample(...) produces a single-turn generation — the model receives a prompt (problem spec plus a few high-scoring past candidates as in-context examples) and emits one complete artifact in one shot: a Python program (math, AtCoder, biology) or a CUDA/Triton kernel (GPUMode). It does not see compiler errors, benchmark timings, or RNA-seq scores within a single rollout. The reward is computed once on the finished artifact and becomes the credit signal for the policy-gradient update on that whole output sequence. The “iteration” happens at the RL-step level — the next batch is drawn from a different policy that has absorbed the gradient signal from this batch — so the model gets progressively better at writing correct kernels first try across training steps, not across turns within one rollout. This contrasts with agentic systems like SWE-agent or Sakana’s AI CUDA Engineer, which keep the LLM frozen and instead wrap it in a multi-turn scaffold (write → compile → read error → fix → re-run → …). TTT-Discover’s bet: in domains with a fast deterministic verifier, updating the weights buys more than giving the model more turns at the same compute.

The substantive line is step 3. AlphaEvolve has steps 1, 2, 4 — and replaces step 3 with “stuff top-scoring candidates into the next prompt as in-context examples.” TTT-Discover instead does an actual RL gradient update, so that the distribution from which step 1 samples shifts toward higher-reward regions of program space as the search proceeds.

The empirical signature is visible in the reward distribution per step. The figure below, from the paper’s project page, plots the reward distribution at steps 0, 9, 24, and 49 on the GPUMode TriMul kernel competition. The whole distribution drifts upward over the course of the run — not just the maximum, but the median, the tails, the modes. The model is becoming progressively more skilled at this one problem.

TTT-Discover reward distribution evolving across training steps
Reward distribution at training steps 0, 9, 24, and 49 on the GPUMode TriMul competition (NVIDIA A100). As RL updates accumulate, the full distribution — not just the best sample — shifts toward higher reward. Source: TTT-Discover project page.

(For background on GPUMode and the TriMul kernel, see the GPU kernel warm-up above.)

对每个问题 \(P\) 配套 evaluator \(E\),TTT-Discover 实例化一个全新的预训练策略副本 \(\pi_{\theta_0}\)(论文使用 gpt-oss-120b,OpenAI 的 120B 开源模型),然后迭代:

init:   policy π_θ = gpt-oss-120b
        problem P with evaluator E
        rollout buffer B = ∅

for step i = 0, 1, 2, ..., N:
    # 1. PROPOSE: 从当前策略采样候选
    candidates = π_θ.sample(P, batch_size=k)
    # 2. SCORE: 确定性 evaluator
    rewards = E(candidates)
    B.extend(zip(candidates, rewards))
    # 3. TRAIN: 在 rollouts 上做 RL 更新
    θ ← rl_update(θ, B)      # ← TTT 步骤
    # 4. KEEP BEST so far
    best = max(best, max(rewards))

return best

Single-turn rollout,multi-step 训练。 每次调用 π_θ.sample(...) 都是 单回合 生成——模型收到 prompt(问题描述 + 几个高分历史候选作为 in-context 示例),一次性 输出一个完整工件:Python 程序(数学、AtCoder、生物学)或 CUDA/Triton kernel(GPUMode)。它在 单次 rollout 内 看不到 编译器报错、benchmark 计时、RNA-seq 分数。奖励在完成工件上一次性算出,作为该整段输出序列的策略梯度信号。”迭代”发生在 RL-步层面——下一批 候选来自 不同的 策略(一个吸收了本批梯度信号的策略),所以模型跨训练步变得越来越擅长 一次写对 kernel,而不是在一次 rollout 里来回 debug。这与 agentic 系统(如 SWE-agent、Sakana 的 AI CUDA Engineer)形成对照:它们让 LLM 冻结,外面包一层多回合脚手架(写 → 编译 → 读错 → 改 → 再跑 →…)。TTT-Discover 的赌注:在有快速确定性 verifier 的领域,同样算力下,更新权重比给模型更多回合更值

关键的是第 3 步。AlphaEvolve 有 1、2、4 步——把第 3 步换成”把得分最高的候选塞回下一次 prompt 当 in-context 示例”。TTT-Discover 则做真实的 RL 梯度更新,让 第 1 步采样所依据的分布 随搜索推进朝程序空间中高奖励区域漂移。

经验签名在每步的奖励分布上可见。下图来自论文项目页,绘制了 GPUMode TriMul kernel 竞赛上第 0、9、24、49 步的奖励分布。整条分布在运行过程中向上漂移——不仅是最大值,中位数、尾部、众数都在动。模型在 这一个 问题上变得越来越擅长。

TTT-Discover 奖励分布随训练步演化
GPUMode TriMul 竞赛(NVIDIA A100)上训练步 0、9、24、49 的奖励分布。RL 更新累积时,整条分布——而非仅最佳样本——向更高奖励漂移。来源:TTT-Discover 项目页

(GPUMode 与 TriMul kernel 的背景见上方 GPU kernel 热身节。)

Reward Design, Training Recipe, and Cost

The 2020 TTT paper used a self-supervised auxiliary loss because vision OOD has no labels at test time. TTT-Discover’s domain is different: every problem comes with an evaluator \(E: \mathrm{candidate} \to \mathbb{R}\) — did the program compile and run? did the proof check in Lean? does the kernel produce correct output, and how fast? Those scores are labels — imperfect (high score doesn’t imply optimum), but enough to drive RL.

Two design consequences follow.

The evaluator is the bottleneck. The method’s leverage comes entirely from how informative and inexpensive \(E\) is. Each step of TTT-Discover requires evaluating a batch of candidates; the inner loop is bounded by evaluator throughput. Math conjectures get this for free (closed-form scoring); GPU kernels need a correctness checker plus a timed benchmark; competitive programming needs hidden test cases; biology needs a held-out reference. The four case studies were chosen specifically because each has a fast, reliable, densely-informative scorer.

The base model has to be capable. A weak model produces samples whose average reward is too low for the gradient to find signal. gpt-oss-120b is large enough to occasionally hit high-reward candidates from cold-start prompting — RL then amplifies that occasional hit into a reliable mode of behavior. Trying TTT-Discover on top of a 7B model would likely fail not because the method is wrong but because RL needs something to reinforce.

Practical details that matter for reproducibility:

  • Backbone: gpt-oss-120b, open-source. One fresh fine-tunable copy per problem.
  • Optimizer & RL objective: standard policy-gradient setup with the rollout buffer as on-policy data; details follow recent open-source RL-for-LLM recipes.
  • Steps per problem: on the order of 50 RL updates (the figure shows steps 0, 9, 24, 49); each step processes a batch of new rollouts.
  • Sampling at each step: thousands of candidates per step in the engineering domains; fewer when each evaluation is expensive (biology).
  • Wall-clock per problem: hours to a day on commodity GPU clusters.
  • Dollar cost per problem: “a few hundred dollars” per the paper — an order of magnitude below the implied AlphaEvolve per-problem cost.

The cost asymmetry comes from two places. First, gpt-oss-120b on rented GPUs is much cheaper per token than frontier-API access to a model like Gemini 2.5 Pro. Second, RL is sample-efficient compared to evolutionary search — a gradient step uses every sample in the buffer; an evolutionary step uses only the survivors. The same compute spend extracts more signal.

2020 年 TTT 论文用自监督辅助 loss,因为视觉 OOD 测试时没有标签。TTT-Discover 的场景不同:每个问题自带 evaluator \(E: \mathrm{candidate} \to \mathbb{R}\)——程序能编译运行吗?证明在 Lean 里通过吗?kernel 输出正确吗、跑多快?这些分数 就是 标签——不完美(高分不等于最优),但足以驱动 RL。

两个设计推论。

Evaluator 是瓶颈。 方法的全部杠杆都依赖 \(E\) 的信息密度与廉价度。TTT-Discover 每一步都要评估一批候选,内层循环被 evaluator 吞吐量卡住。数学猜想免费具备(闭式打分);GPU kernel 需要正确性检查器加计时基准;竞赛编程需要隐藏测试集;生物学需要留出参考。四个案例之所以被选中,正是因为它们都有又快又可靠又信息密集的 scorer。

基础模型必须强。 弱模型产生的样本平均奖励太低,梯度找不到信号。gpt-oss-120b 大到能从冷启 prompt 偶尔击中高奖励候选——RL 然后把偶尔的命中放大成可靠的行为模式。在 7B 模型上跑 TTT-Discover 大概率失败,不是因为方法不对,而是因为 RL 需要 点东西 可以强化。

可复现性相关的实践细节:

  • Backbone: gpt-oss-120b,开源。每个问题一份全新的可微调副本。
  • 优化器与 RL 目标: 标准 policy-gradient 设置,rollout buffer 作为 on-policy 数据;细节沿用近期开源 RL-for-LLM 配方。
  • 每问题步数: 约 50 步 RL 更新(图中显示 0、9、24、49 步);每步处理一批新 rollouts。
  • 每步采样: 工程领域每步数千候选;评估昂贵时(生物学)更少。
  • 每问题 wall-clock: 通用 GPU 集群上数小时到一天。
  • 每问题美元成本: 论文称”几百美元”——比 AlphaEvolve 暗示的每问题成本低一个数量级。

成本差距来自两个地方。其一,租用 GPU 上跑 gpt-oss-120b 比前沿 API(如 Gemini 2.5 Pro)每 token 便宜得多。其二,RL 比演化搜索 sample-efficient——梯度步使用 buffer 里每个样本,演化步只使用幸存者。同等算力榨出更多信号。

Math Results: Erdős & Autocorrelation

Who was Erdős, and what's the minimum overlap problem? (Click to expand)

Paul Erdős (1913–1996) was a Hungarian mathematician famous for his extraordinarily prolific output — over 1500 published papers with 500+ collaborators — and for the habit of posing open problems with small cash prizes attached to their solutions. The minimum overlap problem is one of his characteristic combinatorial questions, posed in 1955.

Informally: partition the integers from 1 to 2n into two equal-sized sets A and B. Now slide one of them — consider B shifted by every integer k. The two sets B+k and A will overlap in some number of points. The minimum overlap problem asks: how small can you make the largest such overlap, by choosing A and B cleverly? Formally, define M(n) = min over partitions of max over shifts k of |A ∩ (B + k)|. Erdős conjectured that the limiting density c = limn→∞ M(n)/n is strictly bounded away from 0 and from 1/4.

The problem is hard because it's extremal: even the most evenly-spread partitions have a "bad shift" that pushes the overlap up. No closed form for c is known; every published bound corresponds to an explicit numerical construction. Improving the bound is a long-running goalpost in additive combinatorics.

For a partition of \(\{1, \dots, 2n\}\) into two equal-sized sets \(A\) and \(B\), define \(M(n) = \min_A \max_k \vert A \cap (B + k) \vert\) — the smallest possible peak overlap as you slide \(B\) over \(A\). The Erdős minimum overlap conjecture asks for the value of \(c = \lim_n M(n)/n\). Improving either bound on \(c\) has been an open problem for decades; AlphaEvolve made the first numerical progress since 2016, and TTT-Discover advances it further.

Method Year Best bound on \(c\)
Last published human result 2016 0.380927
AlphaEvolve 2025 0.380924
TTT-Discover 2026 0.380876

The TTT-Discover number drops the bound by about \(5 \times 10^{-5}\) below AlphaEvolve — small in absolute terms but the kind of move that takes a competition-grade specialist a year of work. Both numbers correspond to explicit constructions (partitions of \(\{1, \dots, 2n\}\) found by the search), so the result is verifiable.

Side-by-side mathematical constructions for the Erdős minimum overlap problem
Constructions for the Erdős minimum overlap problem. Source: TTT-Discover project page.

The autocorrelation inequality is a sibling problem: extremal autocorrelations of indicator functions of subsets of integers. AlphaEvolve had previously pushed it; TTT-Discover takes the same problem and improves further (concrete numbers are reported alongside the Erdős result in the paper’s tables). The methodological point of the math results is not that TTT-Discover solves either conjecture — neither is solved — but that on shared problems and shared compute budgets, the test-time-training lever beats the evolutionary-search lever. That comparison is cleaner than comparing a TTT system against a different method on different problems.

Erdős 是谁?minimum overlap 问题是什么?(点击展开)

Paul Erdős(1913–1996)是匈牙利数学家,以惊人产出闻名——1500 多篇发表论文、500+ 合作者——并习惯以小额现金悬赏的方式提出开放问题。Minimum overlap 问题是他典型的组合学问题之一,于 1955 年提出。

直观地说:把整数 1 到 2n 切成两等分 A 与 B。然后滑动其中之一——考虑 B 平移任意整数 k 后的 B+k。两集合 B+k 与 A 会在若干点上重叠。Minimum overlap 问题问:通过聪明地选择 A 和 B,最的这种重叠能压到多小?形式化定义 M(n) = 所有划分上的最小值的、所有平移 k 上的最大值 |A ∩ (B + k)|。Erdős 猜想极限密度 c = limn→∞ M(n)/n 严格被 0 与 1/4 区间所夹。

问题难是因为它是极值的:再均匀的划分也存在"坏平移"把重叠推上去。c 没有已知闭式;每个公开的界都对应具体的数值构造。改进界是加性组合学的一个长期目标。

对 \(\{1, \dots, 2n\}\) 的两等分 \(A, B\),定义 \(M(n) = \min_A \max_k \vert A \cap (B + k) \vert\)——把 \(B\) 在 \(A\) 上滑动时最小可能的峰值重叠。Erdős minimum overlap 猜想询问 \(c = \lim_n M(n)/n\) 的值。改进 \(c\) 的任一界是几十年悬而未决的问题;AlphaEvolve 是 2016 年以来首次数值进展,TTT-Discover 把它推得更远。

方法 年份 \(c\) 的最佳界
上次公开人类结果 2016 0.380927
AlphaEvolve 2025 0.380924
TTT-Discover 2026 0.380876

TTT-Discover 把界相对 AlphaEvolve 又压低了约 \(5 \times 10^{-5}\)——绝对值小,但对竞赛级专家而言相当于一年工作量的进步。两个数字都对应具体构造(搜索找到的 \(\{1, \dots, 2n\}\) 划分),所以结果可被验证。

Erdős minimum overlap 问题的并排数学构造
Erdős minimum overlap 问题的构造。来源:TTT-Discover 项目页

自相关不等式 是同类问题:整数子集示性函数的极值自相关。AlphaEvolve 此前推进过,TTT-Discover 在同一问题上继续推进(具体数字在论文表格中与 Erdős 结果并列)。数学结果的方法论要点 不是 TTT-Discover 解决了某个猜想——两者都未解决——而是在共享问题与共享计算预算下,test-time training 这根杠杆胜过演化搜索这根杠杆。这种对比比拿 TTT 系统在不同问题上对照不同方法干净得多。

Engineering: GPU Kernels and AtCoder

The most operationally impressive result is on the GPUMode TriMul kernel competition: write a CUDA/Triton kernel that computes a particular triangular matrix multiplication, matching reference correctness and beating the human leaderboard on wall-clock time. The reward signal is straightforward: correctness gate × time speedup. The same TTT-Discover run was repeated on four different accelerator generations:

Hardware Best human (μs) TTT-Discover (μs) Speedup
NVIDIA A100 4,531 2,198 2.06×
AMD MI300x 2,462 1,596 1.54×
NVIDIA H100 1,371 1,161 1.18×
NVIDIA B200 1,005 905 1.11×

Several things to read out of this table. First, the absolute speedup is larger on older / slower chips (A100, MI300x) and shrinks on newer ones (H100, B200) — the older chips leave more room for optimization that human contestants didn’t fully exhaust, while newer chips’ theoretical peak is closer to the human submissions’ wall-clock. Second, the cross-hardware consistency — every chip beaten, including AMD — suggests the method isn’t overfitting to one vendor’s quirks. Third, the human “best” entries are the result of organized public-competition work, not casual baselines, which makes the comparison meaningful.

What's AtCoder, and how does a Heuristic Contest differ from an algorithm contest? (Click to expand)

AtCoder is a Japanese competitive programming platform, comparable in scale and prestige to Codeforces. It runs two main contest formats with very different rules.

Algorithm contests are short sessions (2–5 hours) with problems graded pass / fail — your solution either matches the expected output on every hidden test case or it does not. They reward correctness and asymptotic efficiency.

Heuristic contests (AHC) are different. They run for one to two weeks, problems are NP-hard or otherwise admit no exact polynomial-time solution, and your score is a continuous value — e.g., total cost of a schedule, number of items packed, area covered. Top submissions are the product of dozens of incremental refinements over the contest window by an experienced competitive programmer. Beating the leader by even a small margin is therefore non-trivial: it requires inventing or implementing a heuristic that the field's best didn't try.

AtCoder Heuristic Contests (AHC) are long-form competitive programming problems where the score is continuous (not pass/fail) and the goal is to maximize it — closer in flavor to “construct a good solution” than to “find the algorithm.” TTT-Discover was run on past AHC contests with the contest’s own scoring used as reward.

Contest Domain Best human score TTT-Discover score
AHC39 Geometry 566,997 567,062
AHC58 Scheduling 847,674,723 848,414,228

The margins are small but above the best human, which on contests of this scale means the top of several hundred competitive programmers, often with multiple submissions per contestant. Same method, same model, two very different domains.

操作上最令人印象深刻的结果是 GPUMode TriMul kernel 竞赛:写一个 CUDA/Triton kernel 计算特定三角矩阵乘法,正确性匹配参考实现,wall-clock 时间击败人类 leaderboard。奖励信号简单:正确性 gate × 时间加速比。同一份 TTT-Discover 运行在四代不同加速器上重复:

硬件 人类最佳 (μs) TTT-Discover (μs) 加速比
NVIDIA A100 4,531 2,198 2.06×
AMD MI300x 2,462 1,596 1.54×
NVIDIA H100 1,371 1,161 1.18×
NVIDIA B200 1,005 905 1.11×

这张表能读出几件事。其一,绝对加速比在 较老 / 较慢 芯片上更大(A100, MI300x),在较新芯片上缩小(H100, B200)——较老芯片留给优化的空间更大,人类选手没榨干;较新芯片的理论峰值更接近人类提交的 wall-clock。其二,跨硬件一致性——每个芯片都被击败,包括 AMD——说明方法不是过拟合到单一供应商的怪癖。其三,”人类最佳”条目来自有组织的公开竞赛工作,不是随手 baseline,所以这个对比是有意义的。

什么是 AtCoder?Heuristic Contest 与 Algorithm Contest 有何不同?(点击展开)

AtCoder 是日本竞争性编程平台,规模与声望可比 Codeforces。它运行两种规则差异显著的比赛形式。

Algorithm contests 是短场(2–5 小时),题目按通过/失败评判——你的解要么在每个隐藏测试上匹配预期输出,要么不匹配。它们奖励正确性与渐进效率。

Heuristic contestsAHC)不同。它们持续一到两周,题目是 NP-hard 或不存在精确多项式解的问题,分数是 连续 值——如调度总成本、装箱数量、覆盖面积。顶级提交是经验丰富的竞争性程序员在比赛周期内数十次渐进精化的产物。所以哪怕以微小幅度击败领先者都不容易:需要发明或实现一个领域最强者没试过的启发式方法。

AtCoder Heuristic Contests (AHC) 是长篇竞争性编程题,分数连续(不是通过/失败),目标是最大化分数——更接近”构造一个好解”而非”找算法”。TTT-Discover 在过往 AHC 比赛上跑,用比赛自带的打分作为奖励。

比赛 领域 人类最高分 TTT-Discover 得分
AHC39 几何 566,997 567,062
AHC58 调度 847,674,723 848,414,228

分差不大但 高于人类最佳,这种规模的比赛意味着击败了数百竞争性程序员的最高水平(每位选手通常多次提交)。同一方法,同一模型,两个完全不同的领域。

Biology: Single-Cell RNA Denoising

Single-cell RNA-seq, PBMC, Tabula, and the classical denoisers. (Click to expand)

Single-cell RNA sequencing (scRNA-seq) measures mRNA abundance for every gene in every individual cell. The output is a (cells × genes) count matrix where entry (i, j) is approximately the number of mRNA molecules from gene j detected in cell i. Typical experiments produce 10,000 to 100,000 cells by ~20,000 genes. The matrix is brutally sparse — each cell only has a few thousand non-zero entries — and is contaminated by technical artifacts (uneven capture efficiency, amplification bias, ambient mRNA from lysed cells).

Downstream analyses (cell-type clustering, trajectory inference, marker-gene discovery) are sensitive to that noise, so denoising is a critical preprocessing step. The widely-cited classical denoisers are MAGIC (Markov affinity-based graph imputation), DCA (deep count autoencoder), and scVI (single-cell variational inference); each trains a dedicated model on the dataset.

PBMC = peripheral blood mononuclear cells, a benchmark dataset of immune cells from blood. Tabula refers to the Tabula Sapiens / Tabula Muris cell atlases — large multi-organ scRNA-seq references widely used to evaluate denoising and integration methods.

The biology case study pushes the method farthest from its comfort zone. Single-cell RNA sequencing produces sparse, noisy gene-expression counts; classical denoising methods (MAGIC, DCA, scVI) require training a domain-specific model on the dataset. TTT-Discover instead asks gpt-oss-120b to write a denoising program — a Python script that takes the raw count matrix and emits a denoised one — and uses Pearson correlation against a held-out reference as the reward signal.

Dataset Best human method TTT-Discover
PBMC 0.64 0.71
Tabula 0.64 0.73

Two-to-nine points of Pearson correlation is a meaningful margin on these benchmarks. The interesting design choice is what TTT-Discover outputs: a program, not a denoised matrix. The model is trained to be good at writing denoisers for this dataset, not at denoising itself. That inheritance from FunSearch and AlphaEvolve — programs as the unit of discovery — keeps the artifact interpretable: a biologist can read the generated Python and slot it into a real pipeline. A model that outputted a matrix directly would be hard to trust the same way.

单细胞 RNA-seq、PBMC、Tabula 与经典去噪方法。(点击展开)

单细胞 RNA 测序(scRNA-seq)测量每个独立细胞中每个基因的 mRNA 丰度。输出是 (cells × genes) 计数矩阵,元素 (i, j) 约等于细胞 i 中基因 j 的 mRNA 分子数。典型实验产生 10,000 到 100,000 个细胞 × 约 20,000 个基因。矩阵极度稀疏——每个细胞只有几千个非零项——并被技术伪影污染(捕获效率不一、扩增偏差、来自破裂细胞的环境 mRNA)。

下游分析(细胞类型聚类、轨迹推断、标记基因发现)对该噪声敏感,所以 去噪 是关键预处理步骤。广为引用的经典去噪方法包括 MAGIC(基于 Markov 亲和图的插补)、DCA(深度计数自编码器)、scVI(单细胞变分推断);每种都在数据集上训练专用模型。

PBMC = 外周血单核细胞,来自血液的免疫细胞,基准数据集。Tabula 指 Tabula Sapiens / Tabula Muris 细胞图谱——多器官 scRNA-seq 大型参考数据,广泛用于评估去噪与整合方法。

生物学案例把方法推得离舒适区最远。单细胞 RNA 测序产生稀疏、嘈杂的基因表达计数;经典去噪方法(MAGIC、DCA、scVI)需要在数据集上训练专用模型。TTT-Discover 反过来要求 gpt-oss-120b 写一段去噪程序——一段 Python 脚本,接收原始计数矩阵,输出去噪后的矩阵——并用针对留出参考的 Pearson 相关性作为奖励信号。

数据集 人类最佳方法 TTT-Discover
PBMC 0.64 0.71
Tabula 0.64 0.73

在这些 benchmark 上 2–9 个 Pearson 相关性点是有意义的差距。有趣的设计选择是 TTT-Discover 输出的内容:一段程序,不是去噪后的矩阵。模型被训练得擅长 为这个数据集写去噪器,而不是擅长去噪本身。这个继承自 FunSearch 与 AlphaEvolve 的选择——把程序作为发现的单元——让产物保持可解释性:生物学家可以读生成的 Python 并嵌入真实管线。直接输出矩阵的模型很难以同样的方式信任。

Where This Sits, and What Comes Next

Place the three milestones on a 2×2:

  Single-domain (bespoke) Open-domain (LLM-based)
Frozen at inference Pre-2020 supervised ML FunSearch (2023), AlphaEvolve (2025), o1/R1-style scaling
Updated at inference AlphaProof’s contest-time RL (2024, narrow) TTT-Discover (2026)

(For reference: AlphaProof is DeepMind’s Lean-based theorem-proving system that earned a silver-medal score at IMO 2024 by running its RL training loop during the contest itself — the one prior public example of inference-time weight updates for scientific discovery, but only in the narrow Lean-proof domain.)

TTT-Discover claims the previously empty bottom-right cell: a general-purpose discovery system whose LLM is trained on whatever single problem you point it at, using off-the-shelf open weights as the cold start. The math, engineering, and biology results are the proof that this quadrant is not just theoretically interesting but currently the strongest publicly-reported configuration on the problems where direct comparisons exist.

Four threads worth flagging, two practical and two conceptual.

Practical. Does cost stay flat as problems get harder? The reported few-hundred-dollar figure is presumably averaged over problems where RL did successfully climb the reward landscape. Frontier-difficulty problems may cost much more, or fail to converge at all. Can the trained per-problem weights be reused? The paper treats each problem in isolation by design, but the specialized weights are a real asset — a “fine-tuned for kernels” model could plausibly cold-start later kernel runs faster.

Conceptual. Where does the prior come from? TTT-Discover’s RL does not invent new mathematical machinery; it amplifies structural priors already inside gpt-oss-120b’s pretraining. As problems drift further from anything in the pretraining mix, the leverage RL provides over the base model presumably shrinks. Is “test-time training” still the right framing? When you train for hundreds of GPU-hours on a single problem, the line between “test-time” and “training-time” has blurred almost beyond recognition. The deeper claim is that the unit of training has shifted from a corpus to a problem — a thesis Yu Sun’s group has been building toward since 2020, and the strongest test of which is whether the next paper extends the trajectory.

把三个里程碑放在一个 2×2 上:

  单领域(定制) 开放领域(LLM 基础)
推理时冻结 2020 年前的监督 ML FunSearch (2023), AlphaEvolve (2025), o1/R1 风格 scaling
推理时更新 AlphaProof 比赛时 RL (2024, 狭窄) TTT-Discover (2026)

(背景注:AlphaProof 是 DeepMind 基于 Lean 的定理证明系统,在 IMO 2024 中通过比赛 期间 运行 RL 训练循环获得了银牌级分数——是科学发现领域唯一已公开的推理时权重更新的先例,但仅限于狭窄的 Lean 证明领域。)

TTT-Discover 占据之前空缺的右下格:一个通用发现系统,LLM 在你指向的任何单个问题上 被训练,以现成的开源权重作冷启动。数学、工程、生物学结果是该象限不仅理论上有趣、而且 在能做直接对比的问题上目前是公开报告的最强配置 的证据。

四条线索值得提出,两条实践、两条概念。

实践。 成本随问题难度而平稳吗? 报告的几百美元数字大概是 RL 成功爬上奖励地形的问题的平均值。前沿难度问题可能贵得多,或根本不收敛。训出的单问题权重能复用吗? 论文按设计孤立处理每个问题,但特化权重是真实资产——一个”为 kernel 微调过”的模型合理推测可以让之后的 kernel 任务冷启动更快。

概念。 先验从哪来? TTT-Discover 的 RL 不发明新数学机器,它放大已存在于 gpt-oss-120b 预训练中的结构先验。当问题离预训练混合越来越远,RL 在基础模型上提供的杠杆推测会缩水。“test-time training” 还是合适的提法吗? 一旦你在单个问题上训练数百 GPU 小时,”测试时”与”训练时”的界限几乎模糊到不可辨认。更深的主张是 训练单位从语料库迁移到了问题——Yu Sun 团队从 2020 年起持续推进的论点,对它最强的检验是下一篇论文是否延续这一轨迹。

Beyond the Four Domains: The Wider Benchmark Landscape

TTT-Discover targets four problems by design: an Erdős extremal-combinatorics conjecture, the GPUMode TriMul kernel competition, AtCoder Heuristic Contests, and single-cell RNA denoising on OpenProblems. Each was chosen for a fast deterministic verifier and a meaningful human baseline. Over the last 18 months, the LLM-agent community has built out a much larger benchmark ecosystem with the same shape — and any successor work in this line will be measured against the ecosystem rather than the hand-picked four. This section sketches the map.

TTT-Discover 按设计选了四个问题:Erdős 极值组合学猜想、GPUMode TriMul kernel 竞赛、AtCoder Heuristic Contests、OpenProblems 单细胞 RNA 去噪。每个都因为有快速确定性 verifier 和有意义的人类基线被选中。过去 18 个月里,LLM-agent 社区已经在同样形态周围搭出一个大得多的 benchmark 生态——这条线的任何继任工作都会被对照这个生态衡量,而不是它精选的四个。本节勾画这张地图。

A Four-Criteria Filter for "AI Scientist" Benchmarks

Across the recent literature, benchmarks credibly framed as “AI scientist” or “research agent” evaluations tend to satisfy four properties:

  1. Open-ended. The agent picks among many viable solution paths; there is no single canonical answer to copy from training data.
  2. Ideation matters. Performance is bounded by what non-obvious approach the agent tries, not just execution speed. Hyperparameter tuning alone shouldn’t win.
  3. Multi-step environment. The agent interacts with a runnable artifact — file system, compiler, simulator, dataset. Reward comes from program execution, not string matching.
  4. Non-trivial rollout horizon. Each evaluation takes minutes to hours, which is where iterative search and learned approximate scorers start to pay off.

Several widely-cited benchmarks fail at least one criterion and are therefore measuring something different. SWE-bench Verified problems mostly have a single hidden correct patch (criterion 1 fails for most instances). FrontierMath and PutnamBench score in milliseconds against a single numerical or symbolic answer (criterion 4 fails). BrowseComp asks for a single short string (criterion 3 fails). These are not bad benchmarks — they just measure other things (closed-ended program repair, math-answer recall, web retrieval).

近期文献里能名副其实被称为”AI scientist”或”研究 agent”评估的 benchmark,大体都满足四个性质:

  1. 开放式。 Agent 在多种可行解径间选择;没有可以从训练数据复制的唯一标准答案。
  2. 创意重要。 性能受 非显然方法 的限制,不是仅受执行速度限制。仅靠超参调优不应取胜。
  3. 多步环境。 Agent 与可运行工件交互——文件系统、编译器、模拟器、数据集。奖励来自程序执行,而非字符串匹配。
  4. 非平凡 rollout 时长。 单次评估耗时分钟到小时,这是迭代搜索与近似 scorer 开始划算的区间。

若干常被引用的 benchmark 至少违反一条因此在测量别的东西。SWE-bench Verified 问题大多有一个隐藏的正确补丁(绝大多数实例违反 1)。FrontierMathPutnamBench 毫秒级针对单一数值或符号答案评分(违反 4)。BrowseComp 要求单一短字符串(违反 3)。这些不是糟糕的 benchmark——它们只是在测量别的东西(封闭式程序修复、数学答案回忆、网络检索)。

The Stratified Stack

The benchmarks that do pass the filter organize, somewhat naturally, into tiers of increasing research demand:

Tier What the agent does Representative benchmark Year
ML engineering Win a Kaggle competition MLE-bench (75 Kaggle competitions, OpenAI) 2024
Research engineering Beat human ML-RE workflows RE-Bench (7 tasks, 71 human-expert attempts, METR) 2024
Methodological innovation Beat baselines via new methods MLRC-Bench (7 recent ML-conference contests) 2025
Long-horizon optimization Hill-climb a scoring function for days ALE-Bench (AtCoder Heuristic Contests, Sakana) 2025
GPU kernel optimization Write a CUDA / Triton kernel that beats reference KernelBench (Stanford, 250 PyTorch tasks) 2024
Full paper reproduction Build a paper end-to-end PaperBench (20 ICML 2024 papers, OpenAI) 2025
End-to-end paper generation Pick topic, execute, write AI-Scientist v2 (Sakana, open-ended) 2025

Each tier raises the bar from “given a goal, can you reach it?” toward “can you set your own goal and reach it?” Current frontier LLMs perform passably at the engineering end and progressively worse moving up — o1-preview + AIDE reaches Kaggle bronze in 16.9% of MLE-bench competitions; the best agent on MLRC-Bench closes only 9.3% of the gap between baseline and top human. Performance has not yet saturated at any tier.

A second family sits one level deeper, more useful for training agents than for headline leaderboards: MLAgentBench (Stanford / Snap, 13 ML tasks), MLE-Dojo (a gym-style trainable extension of the MLE-bench distribution with explicit SFT/RL support), MLGym (Meta + NYU, 13 open-ended tasks across CV / NLP / RL / game theory), and DSBench (466 data analysis + 74 data modeling tasks from ModelOff + Kaggle).

The figure below plots ~40 of these benchmarks on a 2D map — domain on the x-axis, research intensity on the y-axis. Each marker is a benchmark; color encodes year. Hover for a one-line description; click to open the paper or repo.

通过 这层过滤的 benchmark 大体上自然地按研究强度分成几层:

层级 Agent 做什么 代表 benchmark 年份
ML 工程 赢一场 Kaggle 比赛 MLE-bench(75 场 Kaggle 比赛,OpenAI) 2024
研究工程 击败人类 ML-RE 工作流 RE-Bench(7 任务,71 人类专家次尝试,METR) 2024
方法论创新 用新方法击败基线 MLRC-Bench(7 个近期 ML 会议比赛) 2025
长时程优化 在评分函数上爬山几天 ALE-Bench(AtCoder Heuristic Contests,Sakana) 2025
GPU kernel 优化 写一个超过参考实现的 CUDA / Triton kernel KernelBench(Stanford,250 PyTorch 任务) 2024
完整论文复现 端到端构建一篇论文 PaperBench(20 篇 ICML 2024 论文,OpenAI) 2025
端到端论文生成 自选题、执行、写作 AI-Scientist v2(Sakana,开放式) 2025

每一层把门槛从”给定目标能否达到”推向”能否 自己定 目标并达到”。当前前沿 LLM 在工程端勉强能用,越往上越差——o1-preview + AIDE 在 MLE-bench 上 16.9% 拿铜牌;MLRC-Bench 上最佳 agent 只关上了基线与人类最高水平之差的 9.3%。任何一层都尚未饱和。

第二族深一层,更适合 训练 agent 而非头条 leaderboard:MLAgentBench(Stanford / Snap,13 个 ML 任务)、MLE-Dojo(MLE-bench 分布的 gym 式可训练扩展,显式支持 SFT/RL)、MLGym(Meta + NYU,13 个跨 CV / NLP / RL / 博弈论的开放任务)、DSBench(466 个数据分析 + 74 个数据建模任务,来源 ModelOff + Kaggle)。

下图把约 40 个 benchmark 放在 2D 地图上——x 轴是研究领域,y 轴是研究强度。每个标记是一个 benchmark;颜色编码年份。悬停看一句话描述;点击打开论文或仓库。

ML & Research Engineering — Deep Dive

The most populated tier — every team has reproducible ML workflows to evaluate, so this is where benchmarking converged first.

  • MLE-bench (OpenAI, arXiv 2410.07095, Oct 2024) — 75 Kaggle competitions across vision, NLP, tabular, signal, biology. Agents train models, prepare datasets, submit predictions; graded against Kaggle’s actual leaderboard medal thresholds. Best published result: o1-preview + AIDE scaffolding reaches Kaggle bronze in 16.9% of competitions. Rollouts span hours to dozens of hours per competition. Note: OpenAI paused new leaderboard submissions on 24-Apr-2026 pending a v2 fairness overhaul; published o1-preview / AIDE numbers and the AIRA Toledo et al. follow-ups remain the comparable anchors. Code: github.com/openai/mle-bench.

  • RE-Bench (METR, arXiv 2411.15114, Nov 2024, ICML 2025) — 7 ML research-engineering environments: scaling-law fitting, GPU kernel optimization, LLM training task search, restricted MLP architecture search, etc. Designed with feedback from frontier-lab research engineers. 71 8-hour attempts by 61 distinct human experts open-sourced; agent transcripts at transcripts.metr.org. One AI agent wrote a faster custom Triton kernel than any of the human experts. 2h / 8h / 32h time-budget protocol. Among the richest trajectory datasets available. Authors request submitters keep solutions out of training data. Code: github.com/METR/RE-Bench.

  • PaperBench (OpenAI, ICML 2025) — reproduce 20 ICML 2024 Spotlight/Oral papers from scratch: codebase + experiments + result matching. 8,316 individually graded sub-tasks via author-co-developed rubrics + LLM judge (SimpleJudge). ML PhDs achieve 41.4%; Claude 3.5 Sonnet New reaches 21.0%. Full PaperBench costs roughly $4,000 USD per model on average; PaperBench Code-Dev variant is cheaper and GPU-free (~$2,000), though the paper notes Code-Dev correlates only “weakly” with full PaperBench. Code: github.com/openai/preparedness.

  • MLAgentBench (Stanford / Snap, arXiv 2310.03302, NeurIPS 2024) — 13 ML experimentation tasks: CIFAR-10 improvement, BabyLM, Kaggle challenges. ReAct-style scaffold with file ops, code execution, output inspection. Best agent (Claude 3 Opus) hits 37.5% average success. Foundational predecessor to MLE-bench / MLRC-Bench. Some tasks (CIFAR-10) now saturated. Code: github.com/snap-stanford/MLAgentBench.

  • MLRC-Bench (arXiv 2504.09702, Apr 2025, NeurIPS 2025) — 7 ML research competition tasks from recent ML conferences (LLM merging, machine unlearning, perception, few-shot). Explicitly designed to require novel methodological innovation beyond hyperparameter tuning. Best agent (gemini-exp-1206 under MLAB scaffold) closes only 9.3% of the gap between baseline and top human — currently the strongest measure of how far frontier LLMs are from doing methodological research. Designed as a “dynamic” benchmark that incorporates new competitions. Code: github.com/yunx-z/MLRC-Bench; HF Spaces leaderboard.

  • MLE-Dojo (arXiv 2505.07782, May 2025) — Gym-style interactive environment built on 200+ Kaggle challenges with a dedicated training set of 150 tasks. Explicitly supports SFT and RL — the trainable counterpart to MLE-bench. Sampling trajectories provided for both. Heavy overlap with MLE-bench’s Kaggle distribution, so most teams pick one or the other depending on whether they need trainable RL trajectories (MLE-Dojo) or industry-standard leaderboard comparability (MLE-bench). Code: github.com/MLE-Dojo/MLE-Dojo.

  • MLGym / MLGym-Bench (Meta + NYU, arXiv 2502.14499, Feb 2025) — Gym-style ML research environment with 13 open-ended tasks spanning CV, NLP, RL, game theory. Baselines from Claude 3.5 Sonnet, Llama-3.1-405B, GPT-4o, o1-preview, Gemini-1.5 Pro. CC-BY-NC 4.0 license — limits commercial deployment. Authors note current frontier models mostly tune hyperparameters in practice rather than ideating. Code: github.com/facebookresearch/MLGym.

  • AI-Scientist v1 & v2 (Sakana, v1 arXiv 2408.06292, 2024; v2 arXiv 2504.08066, 2025). v1 generates end-to-end ML papers on nanoGPT, 2D diffusion, grokking templates. v2 removes templates and uses progressive agentic tree search; one v2-generated paper was accepted at the ICBINB workshop at ICLR 2025 (then voluntarily withdrawn per a pre-agreement). Per ICBINB organizers’ ICLR 2026 proposal, past editions accepted roughly 35 of 40 submissions (~87.5%), far above main-conference rates — treat the “first AI-generated peer-reviewed paper” headline accordingly. ~$15–25 per full paper. Caveat: this is an agent system, not a benchmark with standardized scoring; evaluation requires LLM-as-judge or human review. Code: github.com/SakanaAI/AI-Scientist (~13.8k stars) and github.com/SakanaAI/AI-Scientist-v2 (~6.4k stars).

  • DSBench (ICLR 2025, arXiv 2409.07703) — 466 data-analysis tasks + 74 data-modeling tasks from ModelOff and Kaggle. Long contexts, multimodal data, multi-table reasoning. Best agent solves 34.12% of analysis tasks. More “analyst” than “scientist,” but a useful adjacent reference. Code: github.com/LiqiangJing/DSBench.

最密集的一层——每个团队都有可复现 ML 工作流可评估,所以 benchmarking 先在这里收敛。

  • MLE-bench(OpenAI,arXiv 2410.07095,2024 年 10 月)—— 75 场 Kaggle 比赛,覆盖视觉、NLP、表格、信号、生物学。Agent 训练模型、准备数据、提交预测;对照 Kaggle 实际 leaderboard 奖牌阈值评分。最佳公开结果:o1-preview + AIDE 脚手架在 16.9% 的比赛中拿到 Kaggle 铜牌。每场比赛 rollout 时长几小时到几十小时。注:OpenAI 在 2026 年 4 月 24 日暂停了新 leaderboard 投稿,等待 v2 公平性改进;已公开的 o1-preview / AIDE 数字与 AIRA Toledo 等的后续仍是可比的锚点。代码:github.com/openai/mle-bench

  • RE-Bench(METR,arXiv 2411.15114,2024 年 11 月,ICML 2025)—— 7 个 ML 研究工程环境:scaling law 拟合、GPU kernel 优化、LLM 训练任务搜索、受限 MLP 架构搜索等。在前沿实验室研究工程师反馈下设计。61 名独立人类专家的 71 次 8 小时尝试 已开源;agent transcripts 在 transcripts.metr.org。一个 AI agent 写出的自定义 Triton kernel 比任何人类专家都快。2h / 8h / 32h 时间预算协议。是目前最丰富的轨迹数据集之一。作者要求投稿者不把解发布到训练数据中。代码:github.com/METR/RE-Bench

  • PaperBench(OpenAI,ICML 2025)—— 从零复现 20 篇 ICML 2024 Spotlight / Oral 论文:代码库 + 实验 + 结果匹配。作者共同开发的评分量表加 LLM judge(SimpleJudge),8,316 个分别打分的子任务。ML PhD 达 41.4%;Claude 3.5 Sonnet New 达 21.0%。完整 PaperBench 每模型平均约 $4,000 USDPaperBench Code-Dev 变体更便宜且无需 GPU(约 $2,000),但论文指出 Code-Dev 与完整 PaperBench 仅”弱相关”。代码:github.com/openai/preparedness

  • MLAgentBench(Stanford / Snap,arXiv 2310.03302,NeurIPS 2024)—— 13 个 ML 实验任务:CIFAR-10 改进、BabyLM、Kaggle 挑战。ReAct 式脚手架,含文件操作、代码执行、输出检查。最佳 agent(Claude 3 Opus)平均成功率 37.5%。MLE-bench / MLRC-Bench 的基础前身。部分任务(CIFAR-10)现已饱和。代码:github.com/snap-stanford/MLAgentBench

  • MLRC-BencharXiv 2504.09702,2025 年 4 月,NeurIPS 2025)—— 来自近期 ML 会议的 7 个 ML 研究竞赛 任务(LLM 合并、机器遗忘、感知、few-shot)。明确设计为需要超出超参调优的 新方法论创新。最佳 agent(MLAB 脚手架下的 gemini-exp-1206)只关上了基线与人类最佳之间差距的 9.3% ——目前最强的”前沿 LLM 离做方法论研究有多远”的度量。设计为可纳入新比赛的”动态” benchmark。代码:github.com/yunx-z/MLRC-Bench;HF Spaces leaderboard。

  • MLE-DojoarXiv 2505.07782,2025 年 5 月)—— 基于 200+ Kaggle 挑战的 gym 式交互环境,含 150 任务专用训练集。显式支持 SFT 与 RL ——MLE-bench 的可训练对偶。SFT 与 RL 的采样轨迹都提供。与 MLE-bench 的 Kaggle 分布重合严重,团队多挑一个:要可训练 RL 轨迹用 MLE-Dojo,要工业标准 leaderboard 可比性用 MLE-bench。代码:github.com/MLE-Dojo/MLE-Dojo

  • MLGym / MLGym-Bench(Meta + NYU,arXiv 2502.14499,2025 年 2 月)—— gym 式 ML 研究环境,13 个跨 CV / NLP / RL / 博弈论的开放任务。基线来自 Claude 3.5 Sonnet、Llama-3.1-405B、GPT-4o、o1-preview、Gemini-1.5 Pro。CC-BY-NC 4.0 许可——限制商用部署。作者指出当前前沿模型在实践中大多只调超参,而非创意。代码:github.com/facebookresearch/MLGym

  • AI-Scientist v1 与 v2(Sakana,v1 arXiv 2408.06292,2024;v2 arXiv 2504.08066,2025)。v1 在 nanoGPT、2D diffusion、grokking 模板上端到端生成 ML 论文。v2 移除模板,采用渐进 agentic 树搜索;一篇 v2 生成的论文被 ICLR 2025 的 ICBINB workshop 接收(之后按预先约定自愿撤回)。按 ICBINB 组织者的 ICLR 2026 提案,往届约 40 篇投稿中接收 35 篇(约 87.5%),远高于主会接收率——”首篇 AI 生成的同行评审论文”标题需谨慎对待。完整一篇约 $15–25。注意:这是 agent 系统,不是有标准化打分的 benchmark;评估需 LLM-as-judge 或人审。代码:github.com/SakanaAI/AI-Scientist(约 13.8k stars)与 github.com/SakanaAI/AI-Scientist-v2(约 6.4k stars)。

  • DSBench(ICLR 2025,arXiv 2409.07703)—— 来自 ModelOff 与 Kaggle 的 466 个数据分析任务 + 74 个数据建模任务。长上下文、多模态数据、多表推理。最佳 agent 解决 34.12% 的分析任务。更”分析师”而非”科学家”,但有用的相邻参考。代码:github.com/LiqiangJing/DSBench

Scientific Discovery & Reproduction — Deep Dive

The “scientist-like” tier — domain-grounded tasks closer to actual research workflows than to ML engineering.

  • TTT-Discover’s open-sourced environments (arXiv 2601.16175, Jan 2026) — the four domains the rest of this post analyzed are themselves fully open-sourced: Erdős minimum-overlap & autocorrelation inequalities (closed-form continuous reward; AlphaEvolve set 0.380924, TTT-Discover improved to 0.380876); GPUMode TriMul kernel competition (H100/A100 timing as reward; TTT-Discover 1161μs vs human 1371μs on H100); AtCoder Heuristic Contests (overlaps ALE-Bench; 1st place in ahc039); single-cell denoising on OpenProblems (training on Pancreas, evaluation held out on PBMC and Tabula Muris Senis Lung; Poisson / MSE score). Code: github.com/test-time-training/discover with sandbox reward evaluators and ALE-Bench / GPUMode integration.

  • DiscoveryBench (AI2, arXiv 2407.01725, Jul 2024) — 264 real data-driven discovery tasks from 6 domains (sociology, biology, humanities, economics, engineering, meta-science) + 903 synthetic tasks. Each task = dataset + metadata + natural-language discovery goal. Hypotheses are constrained by available columns — “open-ended” but bounded by the schema. 100+ citations by mid-2025; foundational AI2 reference. Code: github.com/allenai/discoverybench; HF dataset.

  • ScienceAgentBench (OSU, arXiv 2410.05080, ICLR 2025) — 102 tasks from 44 peer-reviewed publications in bioinformatics, computational chemistry, GIS, cognitive neuroscience. Agent must produce a self-contained Python program. Best agentic framework solves ~32.4% (rising to 34.3% with expert knowledge). Code: github.com/OSU-NLP-Group/ScienceAgentBench; HAL leaderboard at hal.cs.princeton.edu.

  • BixBench (FutureHouse, arXiv 2503.00096, Mar 2025) — 53 real-world bioinformatics scenarios with 296 guiding research questions, paired with Jupyter notebooks and heterogeneous input data. Agents execute Python/R/Bash in notebooks; open-ended analytical workflows. Adopted by Biomni Lab, Edison Analysis, and multiple agent providers as of 2026. Code: github.com/Future-House/BixBench; HF dataset.

  • LAB-Bench (FutureHouse, 2024) — 2,400+ multiple-choice questions on practical biology research capabilities. MCQ format limits its use as an open-ended research-agent eval, but a reference for “does the model know the standard bio workflows.”

  • CORE-Bench (Princeton, arXiv 2409.11363, TMLR 2025) — 270 tasks from 90 papers across CS / social science / medicine focused on computational reproducibility (rerun existing code/data). 3 difficulty levels; best agent: 21% on hardest level. Narrower than PaperBench but more grounded.

  • SUPER (Allen AI, arXiv 2409.07440, EMNLP 2024) — 45 expert end-to-end + 152 sub-problems + 602 auto-generated problems on setting up and executing research repos. GPT-4o solves 16.3% e2e / 46.1% sub-problems. Gold solutions and baseline trajectories at github.com/allenai/super-benchmark.

  • RExBench (Tin Lab, arXiv 2506.22598, Jun 2025) — 12 realistic research-extension tasks (NLP/ML), each built on existing papers with domain-expert-written extension instructions. Best agent: ~33% without hints, ~44% with hints. Extensions are given, not discovered. Code: github.com/tinlaboratory/RExBench (gated HF dataset).

“类科学家”层——领域扎根的任务,比 ML 工程更接近真实研究工作流。

  • TTT-Discover 开源环境arXiv 2601.16175,2026 年 1 月)—— 本文其余部分分析的四个领域本身都全部开源:Erdős minimum-overlap 与自相关不等式(闭式连续奖励;AlphaEvolve 设 0.380924,TTT-Discover 改进到 0.380876);GPUMode TriMul kernel 竞赛(H100/A100 计时作奖励;TTT-Discover 在 H100 上 1161μs vs 人类 1371μs);AtCoder Heuristic Contests(与 ALE-Bench 重合;ahc039 第一);OpenProblems 单细胞去噪(Pancreas 上训练,PBMC 与 Tabula Muris Senis Lung 留出评估;Poisson / MSE 评分)。代码:github.com/test-time-training/discover,含 sandbox 奖励 evaluator 与 ALE-Bench / GPUMode 集成。

  • DiscoveryBench(AI2,arXiv 2407.01725,2024 年 7 月)—— 跨 6 领域的 264 个真实数据驱动发现任务(社会学、生物学、人文、经济学、工程、元科学)+ 903 个合成任务。每个任务 = 数据集 + 元数据 + 自然语言发现目标。假设受可用列约束——”开放”但被 schema 框住。到 2025 年中累计 100+ 引用;AI2 基础参考。代码:github.com/allenai/discoverybench;HF dataset。

  • ScienceAgentBench(OSU,arXiv 2410.05080,ICLR 2025)—— 来自 44 篇同行评审论文的 102 任务,涵盖生物信息学、计算化学、GIS、认知神经科学。Agent 必须产出独立 Python 程序。最佳 agentic 框架解决 约 32.4%(加专家知识升到 34.3%)。代码:github.com/OSU-NLP-Group/ScienceAgentBench;HAL leaderboard 在 hal.cs.princeton.edu

  • BixBench(FutureHouse,arXiv 2503.00096,2025 年 3 月)—— 53 个真实生物信息学场景 + 296 个指导研究问题,配 Jupyter notebook 与异构输入数据。Agent 在 notebook 中执行 Python/R/Bash;开放式分析工作流。截至 2026 年被 Biomni Lab、Edison Analysis 与多家 agent 提供商采用。代码:github.com/Future-House/BixBench;HF dataset。

  • LAB-Bench(FutureHouse,2024)—— 2,400+ 道多选题,测实用生物学研究能力。MCQ 格式限制了作开放式研究 agent 评估的用途,但可作”模型是否知道标准生物工作流”的参考。

  • CORE-Bench(Princeton,arXiv 2409.11363,TMLR 2025)—— 来自 CS / 社科 / 医学共 90 篇论文的 270 任务,聚焦 计算可复现性(重跑已有代码/数据)。3 个难度级;最佳 agent 在最难级达 21%。比 PaperBench 窄但更扎实。

  • SUPER(Allen AI,arXiv 2409.07440,EMNLP 2024)—— 45 个专家端到端 + 152 个子问题 + 602 个自动生成问题,关于研究 repo 的搭建与运行。GPT-4o 解决 16.3% 端到端 / 46.1% 子问题。金标准解与基线轨迹在 github.com/allenai/super-benchmark

  • RExBench(Tin Lab,arXiv 2506.22598,2025 年 6 月)—— 12 个真实研究扩展任务(NLP/ML),每个基于已有论文,附领域专家写的扩展指令。最佳 agent:无提示约 33%,有提示约 44%。扩展由人给出,并非 agent 自己发现。代码:github.com/tinlaboratory/RExBench(gated HF dataset)。

Code, Kernels, and Software Engineering — Deep Dive

  • KernelBench (Stanford, arXiv 2502.10517, Dec 2024, ICML 2025) — 250 PyTorch ML workload tasks across 4 difficulty levels (single ops, fused ops, full models, HuggingFace model optimization). Metric: fast_p = correct AND >p× faster than PyTorch. Native single-turn but explicitly supports iterative refinement with execution + profiler feedback; DeepSeek R1 with 10 turns reaches 43% / 72% / 18% across Levels 1/2/3. New kernelbench.com v3 (multi-GPU) and “Hard” (Blackwell) variants released Q1/Q2 2026; KernelBench-X extension exists; follow-ups include Meta’s KernelEvolve, AMD’s GEAK Triton agents, Sakana’s reproductions. Best-paper at DL4C @ ICLR 2025; ICML 2025. The field’s central GPU-kernel-LLM benchmark. Code: github.com/ScalingIntelligence/KernelBench.

  • ALE-Bench (Sakana, arXiv 2506.09050, Jun 2025) — past AtCoder Heuristic Contest tasks: score-based NP-hard optimization (routing, scheduling, packing). The only major benchmark explicitly designed for objective-driven, score-based, long-horizon algorithmic engineering (contests run for one to two weeks). Supports interactive agents with test-run feedback and visualizations. ALE-Agent ships as reference baseline. Backbone for TTT-Discover’s algorithm domain. Code: github.com/SakanaAI/ALE-Bench; HF dataset at hf.co/datasets/SakanaAI/ALE-Bench.

  • SWE-bench Verified / Pro / EVO — the repository-bug-fix family. SWE-bench Verified is 500 human-filtered SWE-bench instances; mostly closed-ended (hidden unit tests) with 1–2 line fixes for 161 of 500 problems. SWE-Bench Pro (Sep 2025) is 1,865 problems from 41 enterprise-style repos with long-horizon, multi-file changes (avg 21 files modified). SWE-EVO (Dec 2025) provides 48 release-note-based tasks with avg 874 tests per instance. All three measure software engineering capability but a single canonical patch is hidden behind tests — doesn’t satisfy “open-ended ideation.” Useful adjacent baselines.

  • SWE-Gym (Berkeley, Dec 2024, ICML 2025) — 2,438 real-world Python tasks with executable runtime, unit tests, NL specs. Designed for training, not just evaluating. The entire fine-tuned model + trajectory dataset are open. 32.0% on SWE-Bench Verified with a 32B model. R2E-Gym extension (arXiv 2504.07164) provides 8.1K procedurally-generated environments for even more trajectory data — strong for trajectory pretraining, weak on ideation.

  • GSO (arXiv 2505.23671, May 2025, NeurIPS 2025 D&B) — 102 software optimization tasks mined from real commits in NumPy, Pandas, Pillow, llama.cpp, etc. across 5 languages. Agent gets codebase + perf test, must beat expert dev optimization. “Hack Detector” anti-gaming. Frontier SWE-agents get <5% OPT@1. Agent trajectories explicitly released at gso-bench.github.io. Closest cousin of KernelBench in spirit.

  • Commit0 (Cornell + Cohere, arXiv 2412.01769, Dec 2024) — 54 entire Python libraries built from scratch from API specs + interactive unit tests. Best agent: 29.30% subset / 6.12% full.

  • Aider Polyglot (Dec 2024) — 225 hard Exercism problems in 6 languages. Light multi-step: two attempts with unit-test feedback. Closed-form puzzles; useful for “does the model code?” but not for ideation-bound evaluation.

  • SWE-Lancer (OpenAI, 2025) — freelance software-engineering tasks scored by code review / payout. Less mature trajectory ecosystem than the SWE-bench family.

  • KernelBench(Stanford,arXiv 2502.10517,2024 年 12 月,ICML 2025)—— 跨 4 难度级共 250 个 PyTorch ML workload 任务(单算子、融合算子、完整模型、HuggingFace 模型优化)。度量:fast_p = 正确且 >p× 快于 PyTorch。原生 single-turn 但显式支持带执行 + profiler 反馈的迭代精化;DeepSeek R1 跑 10 回合在 Level 1/2/3 上分别达 43% / 72% / 18%。2026 Q1/Q2 发布了新的 kernelbench.com v3(多 GPU)与”Hard”(Blackwell)变体;存在 KernelBench-X 扩展;后续包括 Meta 的 KernelEvolve、AMD 的 GEAK Triton agents、Sakana 复现。DL4C @ ICLR 2025 最佳论文;ICML 2025。GPU-kernel-LLM 领域的中心 benchmark。代码:github.com/ScalingIntelligence/KernelBench

  • ALE-Bench(Sakana,arXiv 2506.09050,2025 年 6 月)—— 过往 AtCoder Heuristic Contest 任务:基于分数的 NP-hard 优化(路由、调度、装箱)。唯一显式为目标驱动、分数式、长时程 算法工程设计的主要 benchmark(比赛跑一到两周)。支持带运行反馈与可视化的交互 agent。ALE-Agent 作为参照基线随附。TTT-Discover 算法领域的骨干。代码:github.com/SakanaAI/ALE-Bench;HF dataset 在 hf.co/datasets/SakanaAI/ALE-Bench

  • SWE-bench Verified / Pro / EVO —— 仓库 bug 修复家族。SWE-bench Verified 是 500 个人工筛选过的 SWE-bench 实例;大多封闭(隐藏单测),161/500 个问题只需 1–2 行修复SWE-Bench Pro(2025 年 9 月)是 来自 41 个企业风格 repo 的 1,865 个问题,长时程、跨文件改动(平均 21 文件)。SWE-EVO(2025 年 12 月)有 48 个基于 release note 的任务,平均每实例 874 测试。三者都测软件工程能力,但单一规范补丁藏在测试后面——不满足”开放式 ideation”。作为相邻基线有用。

  • SWE-Gym(Berkeley,2024 年 12 月,ICML 2025)—— 2,438 个真实 Python 任务,含可运行运行时、单测、NL 规范。为 训练 而非仅评估设计。完整微调模型 + 轨迹数据集都开源。32B 模型在 SWE-Bench Verified 上 32.0%R2E-Gym 扩展arXiv 2504.07164)提供 8.1K 个程序化生成的环境 以获得更多轨迹数据——适合轨迹预训练,对 ideation 力弱。

  • GSOarXiv 2505.23671,2025 年 5 月,NeurIPS 2025 D&B)—— 来自 NumPy、Pandas、Pillow、llama.cpp 等真实 commit 的 102 个软件优化任务,跨 5 种语言。Agent 拿到代码库 + perf 测试,必须超越专家开发者优化。”Hack Detector” 防作弊。前沿 SWE-agents 在 OPT@1 上 <5%。Agent 轨迹明确发布在 gso-bench.github.io。精神上 KernelBench 的近表亲。

  • Commit0(Cornell + Cohere,arXiv 2412.01769,2024 年 12 月)—— 从 API 规范 + 交互单测从零构建 54 个完整 Python 库。最佳 agent:子集 29.30% / 完整 6.12%

  • Aider Polyglot(2024 年 12 月)—— 6 种语言的 225 道难 Exercism 题。轻量多步:两次尝试加单测反馈。封闭式谜题;可用于”模型会不会写代码”但不适合 ideation 约束的评估。

  • SWE-Lancer(OpenAI,2025)—— 由代码评审 / 报酬打分的自由职业软件工程任务。轨迹生态比 SWE-bench 家族不成熟。

Math, Theorem Proving, and Formal Reasoning — Deep Dive

  • FrontierMath (Epoch AI, 2024) — per Epoch AI’s benchmark page: “The full FrontierMath dataset contains 350 problems. This is split into a base set of 300 problems, which we call Tiers 1–3, and an expansion set of 50 exceptionally difficult problems, which we call Tier 4.” Single-answer (numerical / symbolic match), sub-second eval. Excellent capability signal but doesn’t reward search.

  • PutnamBench (arXiv 2407.11214) — 640 Putnam problems formalized in Lean 4, Isabelle, and Coq (417 in Coq). A proof attempt is multi-step in the prover but eval is binary pass/fail with fast verification.

  • MiniF2F488 olympiad-level problems formalized across Lean / Isabelle / HOL Light / Metamath. The longest-standing formal math benchmark.

  • LeanDojo / Lean Workbook — end-to-end Lean 4 environment with retrieval; Lean Workbook has ~57K formal-informal problem pairs. Better positioned than PutnamBench for open-ended search because the environment is multi-step at tactic level, though eval is still cheap.

  • Erdős-style open math problems (via TTT-Discover) — the autocorrelation inequality and minimum-overlap problems used by TTT-Discover are the only “open-ended math” tasks that genuinely satisfy continuous reward + ideation matters + multi-step search + recognized math identity. As a benchmark suite they exist only as TTT-Discover’s sandbox.

  • FrontierMath(Epoch AI,2024)—— 按 Epoch AI benchmark 页:”完整 FrontierMath 数据集含 350 个问题。拆为 300 题的基础集(Tier 1–3)和 50 题的扩展集(极难,Tier 4)。”单答案(数值 / 符号匹配),亚秒级评估。能力信号优秀但不奖励搜索。

  • PutnamBencharXiv 2407.11214)—— 640 道 Putnam 题在 Lean 4、Isabelle、Coq 中形式化(Coq 中 417 道)。证明尝试在证明器中是多步,但评估是二值 pass/fail,验证快。

  • MiniF2F —— 跨 Lean / Isabelle / HOL Light / Metamath 形式化的 488 道奥数级题。最久的形式数学 benchmark。

  • LeanDojo / Lean Workbook —— 带检索的端到端 Lean 4 环境;Lean Workbook 含 约 57K 形式-非形式题对。因环境在战术级是多步,比 PutnamBench 更适合开放搜索,但评估仍便宜。

  • Erdős 风格的开放数学问题(经 TTT-Discover)—— TTT-Discover 用的 自相关不等式minimum-overlap 问题 是唯一真正满足连续奖励 + ideation 重要 + 多步搜索 + 数学界认可身份的”开放式数学”任务。作为 benchmark 套件,目前只存在于 TTT-Discover 的 sandbox。

Adjacent Ecosystems: Paper Writing, Web Agents, Games, Evolutionary Reproductions

Paper writing. ResearchBench (Shanghai AI Lab, Mar 2025) covers inspiration retrieval + hypothesis composition + hypothesis ranking over 1,386 papers in 12 disciplines — but it’s a static text benchmark, no multi-step environment. ReportBench (ByteDance) is survey-writing eval via citation overlap + statement validation — same disqualifier. The community-run ML Reproducibility Challenge is annual but with informal eval. AI-Scientist v2 generates papers but evaluation falls back to LLM-as-judge or workshop submission.

Web agents. WebArena, VisualWebArena, WorkArena, Mind2Web provide multi-step browser environments with binary task-completion rewards — useful adjacent baselines if web research is in scope. BrowseComp (OpenAI, Apr 2025) is 1,266 hard short-answer web research questions — single-answer format, sparse reward; OpenAI Deep Research already saturates the easy portion. BrowseComp-Plus (ACL 2026) adds a fixed corpus with human-verified supporting docs; better-controlled but still single-answer.

Game-based RL environments. NetHack via NLE, Crafter, MineDojo, Voyager; BALROG (ICLR 2025) aggregates several. Craftax (JAX, multi-orders-of-magnitude speedup) supports fast iteration. These satisfy “open-ended” and “long-horizon” but the broader ML community frames them as embodied / long-horizon planning testbeds, not “AI scientist” benchmarks. Including them in a research-agent portfolio dilutes the AI-for-science narrative.

Robotics curricula. RoboHive, RLBench, LIBERO are well-developed for robot learning but not “AI scientist” benchmarks.

Evolutionary-search reproductions. AlphaEvolve is closed source. OpenEvolve and CodeEvolve reproduce its template — use these if you want AlphaEvolve-style environments. AlphaEvolve’s reported numbers — per Google Cloud’s “AlphaEvolve on Google Cloud” blog, “AlphaEvolve sped up a vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini’s training time” — are DeepMind-reported and not independently verified. FunSearch domains (cap-set, online bin packing on OR-Library OR1–OR4 + Weibull 5K, cyclic graphs) are public at github.com/google-deepmind/funsearch but the repository ships code, not a bespoke benchmark suite.

Optimization-flavored. OPT-BENCH (arXiv 2506.10764, Jun 2025) is 20 Kaggle ML tasks + 10 NP-hard combinatorial problems with multi-step interaction — moderate fit. OptiBench is single-shot LP/MILP modeling.

Hardware / systems optimization beyond kernels. No widely-accepted benchmark; KernelBench and GSO are the closest the community currently has.

Materials, chemistry, drug discovery. No public benchmark currently matches all four criteria. ChemBench exists as a knowledge / MCQ eval. The most credible attempts at agentic discovery in these areas remain proprietary (Anthropic’s microbiology results, Isomorphic Labs).

论文写作。 ResearchBench(上海 AI Lab,2025 年 3 月)覆盖跨 12 学科 1,386 篇论文 的灵感检索 + 假设组合 + 假设排序——但是静态文本 benchmark,无多步环境。ReportBench(字节)是经引用重合 + 命题校验的综述写作评估——同样不合格。社区运行的 ML Reproducibility Challenge 每年举办,但评估非正式。AI-Scientist v2 生成论文,但评估退化为 LLM-as-judge 或 workshop 投稿。

Web agent。 WebArenaVisualWebArenaWorkArenaMind2Web 提供带二值任务完成奖励的多步浏览器环境——若 web 研究在范围内是有用的相邻基线。BrowseComp(OpenAI,2025 年 4 月)是 1,266 道难的短答案 web 研究题 ——单答案格式,奖励稀疏;OpenAI Deep Research 已饱和容易部分。BrowseComp-Plus(ACL 2026)加入固定语料库与人工核实的支持文档;控制更好但仍单答案。

基于游戏的 RL 环境。 NetHack via NLECrafterMineDojoVoyagerBALROG(ICLR 2025)聚合多个。Craftax(JAX,多个数量级加速)支持快速迭代。这些满足”开放”与”长时程”,但更广的 ML 社区把它们视为 具身 / 长时程规划测试台,而非”AI scientist” benchmark。放进研究 agent 组合会稀释 AI-for-science 叙事。

机器人课程。 RoboHive、RLBench、LIBERO 在机器人学习上很成熟,但不是”AI scientist” benchmark。

演化搜索复现。 AlphaEvolve 闭源。OpenEvolveCodeEvolve 复现了它的模板——想要 AlphaEvolve 风格环境就用它们。AlphaEvolve 的公开数字——按 Google Cloud 的”AlphaEvolve on Google Cloud”博客:”AlphaEvolve 把 Gemini 架构里一个关键 kernel 加速了 23%,让 Gemini 训练时间减少 1%“——是 DeepMind 自报,未经独立验证。FunSearch 领域(cap-set、OR-Library OR1–OR4 + Weibull 5K 上的在线 bin packing、循环图)在 github.com/google-deepmind/funsearch 公开,但仓库放的是代码而非专门 benchmark 套件。

优化味道。 OPT-BENCHarXiv 2506.10764,2025 年 6 月)是 20 个 Kaggle ML 任务 + 10 个 NP-hard 组合问题,多步交互——中等契合。OptiBench 是单回合 LP/MILP 建模。

Kernel 之外的硬件 / 系统优化。 没有广泛认可的 benchmark;KernelBench 与 GSO 是社区目前最接近的。

材料、化学、药物发现。 当前没有公开 benchmark 同时满足四标准。ChemBench 作为知识 / MCQ 评估存在。这些领域 agentic 发现最可信的尝试仍闭源(Anthropic 微生物学结果、Isomorphic Labs)。

TTT-Discover's Four Domains, Mapped

To reproduce TTT-Discover or compare against it, each of its four domains has a corresponding community-recognized benchmark:

TTT-Discover domain Public benchmark Notes
Erdős minimum overlap / autocorrelation TTT-Discover’s open sandbox at github.com/test-time-training/discover; AlphaEvolve-style reproductions in OpenEvolve and CodeEvolve No standalone community benchmark; the problems themselves are recognized math
GPUMode TriMul kernel KernelBench (250 PyTorch tasks across 4 difficulty levels, fast_p metric) TTT-Discover used GPUMode directly; KernelBench is the community-standard generalization
AtCoder Heuristic Contests ALE-Bench (past AHC tasks, score-based continuous reward, ALE-Agent reference baseline) TTT-Discover used AHC directly; ALE-Bench is the now-standard wrapper
Single-cell RNA denoising OpenProblems (single-cell benchmarking framework); BixBench (53 bioinformatics scenarios at notebook-level) OpenProblems is the underlying dataset; BixBench is the agent-task wrapper

The pattern is consistent: TTT-Discover picked the underlying problem source for each domain (GPUMode, AtCoder, OpenProblems) but did not adopt the benchmark wrappers the community has since standardized on (KernelBench, ALE-Bench, BixBench). A successor system can plug into the wrappers directly.

Two adjacent ecosystems share the same open-ended-plus-multi-step-plus-executable shape but extend in different directions:

  • Data-driven scientific discovery. DiscoveryBench (AI2; 264 real data-driven discovery tasks across sociology, biology, humanities, economics, engineering, meta-science), ScienceAgentBench (OSU; 102 tasks from 44 peer-reviewed publications in bioinformatics, comp. chemistry, GIS, cognitive neuroscience), and BixBench push toward analytical workflows rather than just engineering.
  • Software optimization. GSO (102 software optimization tasks mined from real commits in NumPy, Pandas, Pillow, llama.cpp across 5 languages) is the closest cousin of KernelBench in spirit. The SWE-bench / SWE-Gym / R2E-Gym family handles repository-level software engineering, though most instances of SWE-bench Verified are closed-ended bug fixes.

要复现或对照 TTT-Discover,它每个领域都有对应的社区认可 benchmark:

TTT-Discover 领域 公开 benchmark 备注
Erdős minimum overlap / 自相关 TTT-Discover 开源 sandbox:github.com/test-time-training/discover;AlphaEvolve 风格复现 OpenEvolveCodeEvolve 还没独立社区 benchmark;问题本身是已知数学
GPUMode TriMul kernel KernelBench(4 难度级共 250 PyTorch 任务,fast_p 度量) TTT-Discover 直接用 GPUMode;KernelBench 是社区标准化的泛化
AtCoder Heuristic Contests ALE-Bench(过往 AHC 任务,连续分数奖励,含 ALE-Agent 参照基线) TTT-Discover 直接用 AHC;ALE-Bench 是现已标准化的包装
单细胞 RNA 去噪 OpenProblems(单细胞 benchmark 框架);BixBench(notebook 级的 53 个生物信息场景) OpenProblems 是底层数据集;BixBench 是 agent 任务包装

模式一致:TTT-Discover 为每个领域选了 底层问题源(GPUMode、AtCoder、OpenProblems),但没采用社区随后标准化的 benchmark 包装(KernelBench、ALE-Bench、BixBench)。继任系统可以直接接到包装上。

两个共享”开放 + 多步 + 可执行”形态但向不同方向延伸的相邻生态:

  • 数据驱动的科学发现。 DiscoveryBench(AI2,跨社会学、生物学、人文、经济学、工程、元科学的 264 个真实数据驱动发现任务)、ScienceAgentBench(OSU,来自 44 篇生物信息学、计算化学、GIS、认知神经科学同行评审论文的 102 任务)、BixBench,往分析工作流方向推进而非仅工程。
  • 软件优化。 GSO(从 NumPy、Pandas、Pillow、llama.cpp 等真实 commit 中跨 5 种语言挖出的 102 个软件优化任务),精神上 KernelBench 的近表亲。SWE-bench / SWE-Gym / R2E-Gym 家族处理仓库级软件工程,但 SWE-bench Verified 大多实例是封闭式 bug 修复。

What's Still Missing

Three notable gaps as of mid-2026:

  • No materials / chemistry / drug-discovery benchmark satisfies all four criteria. ChemBench is multiple-choice. The most serious “agentic discovery” efforts in these areas are either proprietary (Anthropic’s microbiology work, Isomorphic Labs) or require wet-lab experiments — currently out of reach for pure-software agents.
  • No paper-writing benchmark with closed-loop scoring. ResearchBench (1,386 papers across 12 disciplines) and ReportBench measure surface properties — citation overlap, hypothesis ranking, statement validation — but don’t reward end-to-end execution. AI-Scientist v2 generates papers but evaluation falls back to LLM-as-judge or workshop submission, both noisy and inflated by workshop acceptance rates.
  • Game-based environments — NetHack, Crafter, MineDojo, Voyager, BALROG — are open-ended and long-horizon but the community does not treat them as “AI scientist” benchmarks. They sit cleanly on the “embodied / long-horizon planning testbed” branch instead.

The takeaway: TTT-Discover’s four-task hand-pick was effective at proving the method, but it’s one slice of a much wider landscape. The same RL-at-inference loop should plug, with minimal changes, into MLE-bench, RE-Bench, KernelBench, ALE-Bench, MLRC-Bench, and PaperBench Code-Dev. The next paper in this line will likely report on exactly that — and the comparison against frontier models will move with it.

到 2026 年中,明显的三个缺口:

  • 没有满足四标准的材料 / 化学 / 药物发现 benchmark。 ChemBench 是 MCQ。这些领域最严肃的”agentic 发现”或闭源(Anthropic 微生物学、Isomorphic Labs)或需要湿实验——对纯软件 agent 暂时不可及。
  • 没有闭环打分的论文写作 benchmark。 ResearchBench(跨 12 学科的 1,386 篇论文)与 ReportBench 衡量表面属性——引用重合度、假设排序、命题校验——但不奖励端到端执行。AI-Scientist v2 生成论文,但评估退化为 LLM-as-judge 或 workshop 投稿,两者都嘈杂且被 workshop 接收率高估。
  • 游戏环境——NetHack、Crafter、MineDojo、Voyager、BALROG——开放且长时程,但社区不把它们视为”AI scientist” benchmark。 它们更适合归到”具身 / 长时程规划测试台”那支。

要点:TTT-Discover 的四任务精选有效证明了方法,但只是更广全景的一片。同一份”推理时 RL”循环应该能以最小改动接到 MLE-bench、RE-Bench、KernelBench、ALE-Bench、MLRC-Bench、PaperBench Code-Dev 上。这条线的下一篇论文很可能就在此展开——和前沿模型的对比也会随之转移。