AI for Scientific Discovery: Three Milestones and a Benchmark Map
A Warm-Up: The Cap Set Problem
热身:Cap Set 问题
Before we look at any method, let’s spend a few minutes on one concrete problem all three systems will try to solve — the cap set problem. We’ll work up to the formal statement slowly, starting from a card game you may have played. If you already know the problem from additive combinatorics, you can skim this section.
Start With the Card Game SET
从 SET 卡牌游戏说起
You may know the card game SET. Each card has four attributes — color, shape, number, shading — and each attribute takes one of three values (red/green/purple, oval/squiggle/diamond, 1/2/3, solid/striped/open). So the full deck has \(3 \times 3 \times 3 \times 3 = 81\) cards. A “SET” is three cards such that, for each of the four attributes, the three cards are either all the same or all different. Players race to spot SETs on a shared table.
Hidden inside the game is a natural puzzle:
The no-SET puzzle. Lay out a bunch of cards on the table. What’s the largest number of cards you can put down with no three of them forming a SET?
The answer is 20. You can lay 20 cards from the 81-card deck without any three forming a SET; you cannot lay 21. The constructions that achieve 20 are not obvious — they have specific structure. Our goal in the rest of this warm-up is to turn this innocent-looking card-game question into a mathematical problem you can actually attack.
Encoding the SET Rule as Math
把 SET 规则编码为数学
To reason about the no-SET puzzle we leave the cards behind and work with numbers. Label each attribute value as 0, 1, or 2. (For color, say red = 0, green = 1, purple = 2. The choice is arbitrary; what matters is that each attribute has three labels.) A card now becomes a 4-tuple of digits, e.g., \((0, 2, 1, 1)\) = “red, diamond, 2 shapes, striped”. The 81-card deck is exactly all possible such tuples. Mathematicians write this collection as \(\mathbb{F}_3^4\) — but you can read the symbol as nothing more than “4-tuples whose entries are 0, 1, or 2.”
Now look at when three cards form a SET. Pick any one attribute (one coordinate) and consider the three cards’ values there. By the SET rule, the three values are either all the same or all different. Sum the three values in each case:
- All same, e.g. \(1 + 1 + 1 = 3\). Divisible by 3.
- All different — three values \(0, 1, 2\) in some order — \(0 + 1 + 2 = 3\). Divisible by 3.
- Two same one different (a non-SET case), e.g. \(0 + 0 + 1 = 1\). Not divisible by 3.
So the SET rule, attribute by attribute, is exactly the rule “three values sum to a multiple of 3.” Stacking the four attributes back together:
Three cards \(x, y, z\) form a SET \(\iff x + y + z = (0,0,0,0)\) in \(\mathbb{F}_3^4\), where addition is taken coordinate-wise mod 3.
This is the small but important translation step. The puzzle “what’s the largest no-SET collection of cards?” has become a clean mathematical question:
What’s the largest subset of \(\mathbb{F}_3^4\) in which no three distinct elements sum to zero (mod 3)?
Cap Sets in F3n, and Why They're Hard
F3n 中的 Cap Set,以及它为什么难
There’s nothing magic about four attributes. Replace 4 with any positive integer \(n\) and you get a whole family of problems — the cap set problem:
Given \(\mathbb{F}_3^n\) (the set of \(n\)-tuples with entries in \(\{0, 1, 2\}\)), find the largest subset such that no three distinct elements sum to zero coordinate-wise mod 3. A subset with this property is called a cap set in \(\mathbb{F}_3^n\).
Small cases can be computed (or, for \(n \leq 6\), even proved exactly):
| \(n\) | 1 | 2 | 3 | 4 (SET deck) | 5 | 6 |
|---|---|---|---|---|---|---|
| Max cap set size | 2 | 4 | 9 | 20 | 45 | 112 |
For \(n \geq 7\) the exact answer is unknown; only upper and lower bounds exist. For \(n = 8\) the largest known explicit construction sits at 496 — and improving that by even one element is genuine research. FunSearch produced a construction of size 512 for \(n = 8\), which is exactly such a research-level advance.
Why is the problem so hard? The space \(\mathbb{F}_3^n\) has \(3^n\) elements, so it has \(2^{3^n}\) subsets. For \(n = 8\) that’s \(2^{6561}\), a number with roughly 1975 decimal digits. Brute-force search over subsets is hopeless. You need a construction — a clever recipe that builds a large subset while systematically avoiding any three elements that sum to zero. Such constructions are rare and typically come from deep structural insights: group theory, error-correcting codes, additive number theory.
The slow progress is telling. Between 2002 and 2023, the largest known cap set in \(\mathbb{F}_3^8\) was improved by humans only a handful of times. FunSearch’s 2023 contribution is in this slow line.
To get a visceral sense for how the constraint bites, try the simplest non-trivial case yourself. The widget below is a 3×3 grid of \(\mathbb{F}_3^2\). Click points to add them to your cap set; any three selected points whose coordinates sum to \((0, 0)\) mod 3 will be flagged with a red dashed triangle. The maximum cap set in \(\mathbb{F}_3^2\) has size 4 — try to find one. The surprise is usually how few points you need before some “wrap-around” triple snaps shut.
Why This Is the Right Yardstick
为什么它是合适的标尺
The cap set problem has three properties that make it nearly ideal as a benchmark for any LLM-driven discovery system:
- Mechanically verifiable. A few lines of Python check whether a given subset is a cap set: iterate over triples, check if any sums to zero mod 3. There is no judgment call, no need for a human or another LLM to grade the output.
- No closed form. There is no formula to derive the answer from. Every improvement is genuinely new mathematics, not a rederivation.
- Connected to deep mathematics. Cap-set sizes control the asymptotic exponent \(\omega\) of matrix multiplication via the Cohn–Umans framework. The problem is also the simplest non-trivial instance of “find dense subsets that contain no three-term arithmetic progressions” — a foundational question in additive combinatorics related to Roth’s theorem.
These three properties — easy to verify, no closed form, connected to deeper structure — are exactly what we’ll want from every benchmark in this post. FunSearch picked the cap set problem as one of its two flagship demonstrations because of them. We’ll see analogues in AlphaEvolve (the Erdős minimum-overlap problem) and TTT-Discover (the same Erdős problem, plus the autocorrelation inequality, GPU kernels, AtCoder, biology). The shape of the problem matters as much as the method.
With this concrete problem in mind, we can now look at how FunSearch, AlphaEvolve, and TTT-Discover try to solve it.
FunSearch (2023): Searching the Space of Programs
FunSearch (2023):在程序空间中搜索
The path from “LLMs as conversational assistants” to “LLMs as discovery engines” was opened by FunSearch (Romera-Paredes et al., Nature, Dec 2023). It established the template every subsequent system — including TTT-Discover — has built on: a frozen pretrained LLM proposes candidate programs, a deterministic evaluator scores them, an evolutionary outer loop selects survivors and stuffs them back into the next prompt. Understanding what FunSearch did, and where its design choices start to bind, is the cleanest entry point to the rest of this post.
The Idea: Search Over Programs, Not Outputs
核心思想:在程序而非输出上搜索
The non-obvious move in FunSearch is that the LLM is asked to produce programs that solve the problem, not answers to the problem. For a problem like “find the largest cap set in \(\mathbb{F}_3^n\),” there are at least three plausible search targets:
- The answer directly — a literal set of vectors. Hard to mutate locally, hard to characterize diversity over, and the search space scales with the answer size.
- A solver — a program that, given \(n\), returns a cap set. Compact, interpretable, runs in milliseconds, gives a clean scalar score.
- An informal proof or argument — what GPT-style chain-of-thought defaults to. Hard to verify automatically; needs an LLM-as-judge that itself becomes a noise source.
Why is a program verifiable when chain-of-thought isn’t? Both produce a final answer. The asymmetry is in how you check the answer, not in whether one exists. FunSearch’s deliverable is code: the evaluator runs it and the constructed object (a size-512 cap set, a packing of \(10^6\) items) is generated at evaluation time, then checked mechanically — a single
assertconfirms “is this a cap set?”. Chain-of-thought’s deliverable is prose, which leaves two bad options: either embed the answer literally in the text (a 512-vector list rarely survives an LLM completion intact, and parsing it back out lands you in option 1 with all its scale problems), or argue informally toward it (now verification needs to follow reasoning, which a deterministic checker cannot do — you fall back to another LLM-as-judge or a strict proof checker like Lean, the narrow path AlphaProof takes). Running code is one line of evaluator; reading prose needs parsing, trust, or both.
FunSearch picks (2). Three properties follow:
- Compactness. A 20-line Python function can generate arbitrarily large outputs (e.g., a cap set for any \(n\)). Direct-output search over strings of size proportional to the answer cannot.
- Verifiability. Run the function and compute the score. No subjective judgment, no LLM-as-judge loop, no proof checker required.
- Interpretability. A mathematician can read the resulting Python and understand what construction was discovered — sometimes well enough to generalize it into a paper-quality proof.
The “Fun” in FunSearch is function: searching in the space of functions. That naming choice telegraphs the central design commitment.
Architecture: Frozen LLM + Evolutionary Loop + Sandboxed Evaluator
架构:冻结 LLM + 演化循环 + 沙盒 evaluator
FunSearch’s loop has four components:
- Pretrained LLM. DeepMind used Codey, a PaLM 2 variant fine-tuned for code, frozen throughout. Its only role is to take a prompt of existing programs and write a new one.
- Programs database. An island-based evolutionary population. Each “island” maintains its own pool of high-scoring programs; periodic inter-island migration prevents premature convergence.
- Prompt builder. Selects \(k\) programs from the database (biased toward high scorers), formats them as in-context examples, and asks the LLM to write a better one.
- Sandboxed evaluator. Runs the candidate on the target instance, returns a scalar score. Evaluation is fast (milliseconds) and deterministic.
Crucial constraint: FunSearch evolves one function of at most around 20 lines — the “priority function” or “heuristic” that embodies the central decision the search is optimizing. Everything around it (main loop, I/O, scoring) is fixed by the human designer. This narrow target keeps per-evaluation cost tiny and lets the loop process millions of candidates in one experiment. AlphaEvolve and TTT-Discover both relax this constraint (entire files vs. single functions), and the cost structure changes accordingly.
The frozenness of the LLM is the other binding constraint, and one we will keep coming back to. FunSearch’s pretrained Codey weights are identical at iteration 1 and iteration \(10^6\). Whatever skill the system shows is amortized over all the problems Codey was ever pretrained on, plus the in-context examples assembled at runtime.
Headline Results: Cap Sets and Online Bin Packing
标志性结果:Cap Set 与在线装箱
What's online bin packing? (Click to expand)
Online bin packing: items of various sizes arrive one at a time and must be placed immediately, without knowledge of future arrivals, into bins of fixed capacity. Goal: minimize the total number of bins used. The problem is a classical operations-research staple, modeling settings from VM placement in data centers to truck loading.
Standard heuristics include First-Fit (place in the first bin with room), Best-Fit (place in the tightest bin with room), and First-Fit-Decreasing (offline; sort first). Their competitive ratios are well-studied; improving on them under realistic distributions is non-trivial because the "obvious" tweaks tend to break worst-case guarantees.
FunSearch evolved a small priority function that beats First-Fit and Best-Fit by meaningful margins on standard benchmark distributions, and is readable enough that the underlying "waste-aware" decision rule can be explained in a single sentence.
Two flagship results.
Cap set. For multiple values of \(n\), FunSearch produced larger explicit cap sets in \(\mathbb{F}_3^n\) than any previously known construction. For \(n = 8\) it produced a cap set of size 512. This is a real advance: cap-set lower bounds had been a slow-moving target in additive combinatorics, with progress measured in rare exact constructions, and FunSearch contributed one. The discovered Python function was short enough that mathematicians could read it, simplify it, and study what symmetry it was exploiting.
Online bin packing. FunSearch evolved a priority function that outperforms First-Fit and Best-Fit on multiple benchmark distributions, with a decision rule readable enough to deploy. This was the first demonstration that an LLM-driven search could discover practical algorithmic heuristics whose performance is verified by running on real workloads — not merely solve an artificial puzzle.
The two results matter for different reasons. Cap sets show the system can advance a research-level open problem. Bin packing shows it can produce practical artifacts. Both are made possible by a frozen LLM — there is no learning happening, only sampling from a fixed prior plus selection. The size of that prior and the cleverness of the prompt builder do the work. This is the design point AlphaEvolve and then TTT-Discover would push against in opposite directions.
Why FunSearch Works: Verify Is Easy, Construct Is Hard
FunSearch 为什么能成功:验证容易,构造极难
Two complementary properties make cap set the right problem for an LLM-evolutionary loop. Together they let a noisy proposer beat decades of careful human work.
Verification is cheap and exact. Given any subset \(S \subseteq \mathbb{F}_3^n\), checking whether \(S\) is a cap set takes a few lines of Python:
import itertools
def is_cap_set(S, n):
"""Return True iff no 3 elements of S sum to (0,...,0) mod 3."""
for a, b, c in itertools.combinations(S, 3):
if all((a[k] + b[k] + c[k]) % 3 == 0 for k in range(n)):
return False
return True
That’s it. No subjective judgment, no LLM-as-judge, no learned verifier. The check runs in \(O(\vert S\vert^3 \cdot n)\) time — polynomial in the size of the answer, regardless of how clever the construction was. The evaluator runs this in milliseconds on a candidate and emits a clean numerical score (typically: \(\vert S \vert\) if valid, else 0).
Construction is exponentially hard. \(\mathbb{F}_3^n\) has \(3^n\) elements and \(2^{3^n}\) subsets. For \(n = 8\) that is \(2^{6561}\) — enumeration is hopeless. Worse, naive greedy construction (“add the next compatible element”) plateaus quickly: the largest cap sets require non-local choices, where committing to a slightly worse early element opens up a much larger valid extension later. Humans have produced cap-set lower bounds slowly over decades by inventing constructions from algebraic geometry, group-theoretic symmetries, and error-correcting codes — each requiring years of domain expertise to find a single new construction worth publishing.
This is the canonical NP-style asymmetry: a valid construction is poly-time checkable, but finding one (let alone an optimal one) is exponential in the worst case. Most interesting scientific-discovery problems share this shape — theorems are easy to verify in Lean, programs are easy to run, kernels are easy to benchmark, but inventing the artifact is hard.
What FunSearch outsources. Crucially, FunSearch does not enumerate cap sets or try to “understand” them mathematically. It evolves a small priority(el) function that scores individual elements of \(\mathbb{F}_3^n\). A fixed greedy driver then builds the cap set by taking elements in priority order, rejecting any that would form a line with two already-chosen elements:
import itertools
# === The ~20-line function FunSearch EVOLVES. ===
# The published priority function for n=8 sums several modular-arithmetic
# terms; the form below sketches its structure. See Romera-Paredes et al.
# 2023, Extended Data Fig. 3, for the exact code.
def priority(el: tuple[int, ...], n: int) -> float:
score = 0.0
for i in range(n):
score += LINEAR_COEFF[i][el[i]] # per-coordinate
for i in range(n):
for j in range(i + 1, n):
score += PAIR_COEFF[i][j][el[i]][el[j]] # pairwise
# ... a few more terms; the coefficients embed the discovered structure
return score
# === The FIXED driver. FunSearch does NOT evolve this. ===
def build_cap_set(n: int) -> list[tuple[int, ...]]:
elements = list(itertools.product(range(3), repeat=n))
elements.sort(key=lambda e: -priority(e, n)) # highest first
chosen: list[tuple[int, ...]] = []
for e in elements:
# Reject if e completes a line with any pair already chosen.
forms_line = any(
all((a[k] + b[k] + e[k]) % 3 == 0 for k in range(n))
for a in chosen for b in chosen if a < b
)
if not forms_line:
chosen.append(e)
return chosen
All the intellectual content lives in priority. The LLM proposes candidate priority functions; the evaluator runs build_cap_set and reports len(chosen); the evolutionary loop keeps the high-scoring ones and asks the LLM for variations. Over millions of evaluations, the search converges on priority functions whose induced greedy ordering produces unusually large cap sets — for \(n=8\), FunSearch’s discovered priority function produces a cap set of size 512.
Why this division of labor matters. Humans hit a cognitive wall when asked “how should I score an element to bias the greedy toward large constructions?” The question has millions of plausible answers, no closed form for the best one, and no obvious symmetry to exploit. A mathematician trying to write the priority function from scratch would think in terms of group orbits, linear codes, or quadratic forms — but the best priority function FunSearch finds may not match any of those: it’s a list of coefficients tuned through millions of greedy runs, each scored objectively. The LLM does not think mathematically; it samples short Python functions, each one a guess. Most are bad. A few are surprisingly good. The evaluator separates them cheaply.
The core asymmetry. Cheap to test, intractable to design. This is what lets a noisy LLM proposer plus a deterministic verifier discover what no human could write directly. AlphaEvolve and TTT-Discover both inherit this same asymmetry — on bigger problems (full code files, GPU kernels, biology pipelines) and with bigger budgets, but the underlying logic is identical.
AlphaEvolve (2025): Scaling the Template
AlphaEvolve (2025):把模板扩展上去
AlphaEvolve (DeepMind, 2025) inherits FunSearch’s shape and scales every axis of it: bigger models, longer programs, richer evaluator feedback, and the operational discipline to run all of this against problems that are economically meaningful at Google’s scale. The combination unlocks results that FunSearch’s constraints made out of reach.
What Changed: Functions to Files, One Model to Two
变了什么:函数到文件、单模型到双模型
Three architectural choices distinguish AlphaEvolve from FunSearch:
Two-tier LLM ensemble. A single run uses both Gemini 2.0 Flash (fast, cheap, generates the bulk of proposals) and Gemini 2.5 Pro (slower, smarter, used occasionally to seed harder candidates). Pure 2.5 Pro would be too expensive to run at evolutionary throughput; pure 2.0 Flash misses the deeper moves. The ensemble gets most of the throughput at Flash cost while keeping Pro’s creativity available where it counts.
Evolution at code-file granularity. Where FunSearch evolved single ≤20-line Python functions, AlphaEvolve evolves entire code files of hundreds of lines in any language. Candidates can carry non-trivial state, helper functions, even small frameworks. The trade-off is that each evaluation may take hours rather than milliseconds — but for high-value problems (kernel speedups, scheduler heuristics), that’s still worth it.
Parallel evaluators with rich feedback. Each candidate is dispatched to a parallel evaluator pool. The evaluator returns not just a scalar score but per-objective signals the next prompt can reference. The outer loop is still island-based evolution inherited from FunSearch, but populations and migration patterns are scaled up.
The cost asymmetry vs. FunSearch is the obvious downside: AlphaEvolve’s per-evaluation cost is orders of magnitude higher, so total candidates explored is orders of magnitude lower. The bet is that the richer search space (full files) and richer feedback (per-objective signals) more than compensate.
Benchmark Results: Math and Infrastructure
基准结果:数学与基础设施
What does this machinery actually produce? The headline results across math and infrastructure:
| Domain | Problem | Prior best | AlphaEvolve | Direction |
|---|---|---|---|---|
| Algorithm discovery | 4×4 complex matrix mult | 49 mults (Strassen 1969) | 48 mults | Fewer is better |
| Extremal combinatorics | Erdős minimum-overlap | 0.380927 (2016) | 0.380924 | Lower is better |
| Discrete geometry | Kissing number / packing | Various human records | Improved in several settings | Better bound |
| ML systems | Gemini training kernel | Hand-tuned baseline | ~23% speedup | Faster is better |
| Cluster scheduling | Borg scheduling heuristic | Existing production | ~0.7% compute recovered | Higher recovery |
| TPU hardware | Verilog circuit | Existing design | Reduced gate count | Fewer gates better |
What's Strassen's algorithm, and why is 48 mults a big deal? (Click to expand)
Volker Strassen showed in 1969 that the product of two 2×2 matrices can be computed with just 7 scalar multiplications (and more additions), not the obvious 8. Recursing this insight via block decomposition on n×n matrices gives runtime O(nlog2 7) ≈ O(n2.807) — the first algorithm beating the cubic O(n3) baseline. Applied once more to 4×4 = (2×2)×(2×2) the recursion gives 7×7 = 49 scalar multiplications.
That 49-mult count for 4×4 over the complex numbers stood unchallenged for 56 years, until AlphaEvolve found 48 in 2025. Earlier in 2022, AlphaTensor had already found a 47-mult algorithm for the 4×4 case — but only over the field of two elements ℤ2, not over the reals or complex numbers, so it does not apply to standard linear algebra.
The math results matter as proofs of capability: extremal problems with crisp, verifiable answers, on which AlphaEvolve unambiguously beat the human-mathematician state of the art (e.g., a 56-year-old Strassen record). The production results matter as proofs of value: deployed at Google scale, returning real compute and silicon savings. Both create non-trivial benchmarks for any successor system to target — including TTT-Discover, which directly reuses the Erdős minimum-overlap problem to claim a head-to-head win.
The Frozen-Weight Ceiling
冻结权重的天花板
Through AlphaEvolve’s entire pipeline, the LLM itself does not learn. Its weights at problem 1000 are bitwise identical to its weights at problem 1, just as FunSearch’s Codey weights were. All problem-specific behavior comes from in-context conditioning: the prompt for proposal \(N+1\) contains carefully-curated examples from \(N\). This is enormously powerful — frontier LLMs have huge implicit priors that the evolutionary loop can amplify — but it is also a ceiling.
Once the context budget is exhausted, the model cannot get better at this problem; the only remaining knob is running the loop longer and hoping a sample lands. Empirically, AlphaEvolve runs show diminishing returns after some number of generations on hard problems: the best score stops improving and the wall-clock cost per marginal advance balloons. The remedy that AlphaEvolve does not attempt is to update the model. TTT-Discover’s bet is precisely that moving the LLM’s weights themselves — modestly, RL-style, on the test problem — clears that ceiling at acceptable cost.
A Second Warm-Up: GPU Kernel Design
第二次热身:GPU Kernel 设计
TTT-Discover’s most operationally impressive results are on GPU kernels — beating hand-tuned human champions on four different chips. To appreciate why that’s a hard problem (and what TTT-Discover’s RL loop is actually searching over), we need a short tour of how GPU kernels become fast. The same warm-up-then-method pattern as the cap-set section: this part introduces the problem; the TTT-Discover section that follows shows the result.
What Is a GPU Kernel, and Why Is Making It Fast Hard?
GPU Kernel 是什么?为什么调它这么难?
A GPU kernel is a small program that runs in parallel across many threads on a GPU. A modern NVIDIA H100 has 132 streaming multiprocessors (SMs), each running thousands of threads concurrently — so a single kernel launch can have on the order of \(10^6\) thread instances in flight at once. The GPU schedules them across SMs automatically.
For matrix multiplication \(C = A \cdot B\) with \(C\) of size \(M \times N\), the simplest possible kernel — the naive kernel — assigns one thread per output element. Thread \((i, j)\) reads row \(i\) of \(A\), column \(j\) of \(B\), accumulates a \(K\)-long dot product, and writes \(C_{ij}\). Correctness is trivial; performance is terrible. Modern GPUs reach their headline TFLOPS only when the compute units are fed fast enough — and reading from HBM (high-bandwidth memory) is much slower per byte than computing on already-loaded data. The naive kernel re-reads each element of \(A\) and \(B\) many times, the SMs spend most of their cycles waiting on memory, and the achieved throughput is typically under 5% of peak.
The entire subfield of GPU kernel optimization is about closing this gap.
The Fundamental Tension: Compute vs Memory
基本张力:算力 vs 内存
Every kernel sits somewhere on a roofline:
- Memory-bound regime. Throughput is limited by how fast data arrives from HBM. To go faster, reduce bytes-moved per useful op.
- Compute-bound regime. Throughput is limited by how many FLOPs the SMs can do per cycle. To go faster, increase parallelism or use higher-throughput math units (tensor cores).
The dividing line is the kernel’s arithmetic intensity: FLOPs done per byte loaded. The crossover sits at \((\text{peak FLOPS}) \,/\, (\text{peak bandwidth})\). For an H100 with ~50 TFLOPS fp32 and ~3 TB/s memory bandwidth, that’s about 17 FLOPs/byte. Below 17, any further compute optimization is wasted; above 17, any further bandwidth optimization is wasted. Every well-tuned kernel pushes its arithmetic intensity firmly above the crossover. How depends on the operation. For matmul, the dominant tool is tiling.
Tile Size: The Key Knob
Tile Size:关键旋钮
Instead of one output element per thread, organize threads into blocks that cooperatively compute a \(T \times T\) tile of \(C\). The block first loads a strip of \(A\) (T rows, K columns) and a strip of \(B\) (K rows, T columns) into shared memory — on-chip SRAM, roughly 100× faster than HBM. The threads in the block then compute the entire \(T \times T\) tile from shared memory only, reusing each loaded element \(T\) times.
For each \(T \times T\) output tile:
- HBM bytes read: \(4 \cdot 2 T K = 8 T K\) (fp32)
- FLOPs done: \(2 T^2 K\) (one multiply + one add per \(T \times T \times K\) outer-product)
- Arithmetic intensity: \(T / 4\) FLOPs/byte
So doubling the tile size doubles the arithmetic intensity. Going from \(T = 1\) (the naive kernel) to \(T = 64\) takes you from 0.25 FLOPs/byte to 16 FLOPs/byte — right at the H100 crossover.
But \(T\) cannot grow without bound. Shared memory per block is only 48–228 KB, and each block needs \(4(T^2 + 2TK)\) bytes for the output tile and input strips. The register file and the total threads-per-block both impose further ceilings. If blocks become too big, you launch too few of them in parallel and lose occupancy — SMs sit idle for lack of work to schedule. So choosing \(T\) is a balancing act: large enough to be compute-bound, small enough to fit and to keep the SMs busy.
Try it yourself. The widget below models a simple square-tile matmul against representative GPU parameters (50 TFLOPS compute, 3 TB/s bandwidth, 48 KB shared memory per block). Slide \(T\) from 1 (the naive kernel) up through 128 (oversized) and watch the user point trace along the roofline.
The interesting band is roughly \(T \in [64, 96]\): large enough to be compute-bound, small enough to fit. Outside that band you’re leaving performance on the table. Real production kernels go much further — multi-level tiling (warp / thread / block), software pipelining, tensor-core packing, hand-tuned warp specialization, branch reduction, instruction-level scheduling — but the tile-size tradeoff is the first lever any optimizing kernel pulls.
Why TriMul Is Special (and What TTT-Discover Optimizes)
为什么 TriMul 特殊(以及 TTT-Discover 在优化什么)
The TriMul kernel is matmul with a twist: one input is restricted to upper- or lower-triangular, so half its entries are known to be zero. This pattern is ubiquitous in transformer training — the causal attention mask makes \(Q K^\top\) effectively triangular — so a fast TriMul translates directly into pretraining speedups for every causal-LM stack.
A naive kernel doesn’t know about the zeros. It loads them, multiplies by them, adds them, and wastes half the FLOPs. A specialized kernel must:
- Skip dead loads. Don’t fetch elements known to be zero — recover the lost bandwidth.
- Compress storage. Pack the triangular matrix into half the space — doubles effective bandwidth again.
- Branch-align tiles. Pick tile dimensions so each tile lies entirely above, below, or on the diagonal — never straddling, which forces warps to branch (all threads in a warp must take the same control path, so divergent branches serialize).
- Pipeline across the triangle. Reorder block iteration so warp issue rates stay high in the “thin” rows/columns near the diagonal.
These optimizations interact. The tile size that maximizes arithmetic intensity may put tile boundaries on the wrong side of the diagonal. The compressed layout that saves bandwidth may force ugly strided access patterns. Picking the best combination is what a competitive GPU engineer does over weeks of profiling.
This is the design space TTT-Discover walks. The system inherits an LLM’s prior knowledge of CUDA/Triton idioms, and uses RL to specialize a kernel for a specific TriMul shape on a specific chip. Each chip generation has different tradeoffs — H100 has tensor cores and 228 KB shared memory per block, A100 has 164 KB, AMD MI300x has wave-level instead of warp-level execution, B200 has yet another set of changes — so the optimal kernel for each chip is a different small program. The headline result you’ll see in the TTT-Discover section — beating GPUMode champions on four chip generations with one recipe — is impressive precisely because the same RL loop found a different optimum for each chip, with no human re-tuning.
TTT-Discover (2026): Training at Inference Time
TTT-Discover (2026):推理时训练
TTT-Discover keeps everything AlphaEvolve does outside the LLM — the evaluator, the prompt-building, the buffer of past programs — and changes one thing inside it: at every step, the LLM’s weights are updated by reinforcement learning, using the evaluator’s scores as reward. The rest of this section walks through what that one change buys, how it’s implemented, and how it performs on four very different problem domains.
Test-Time Training in One Paragraph
一段话讲完 Test-Time Training
Test-time training (TTT), introduced by Sun, Wang, Liu, Miller, Efros, Hardt at ICML 2020, takes a single test instance and runs a few gradient steps of a self-supervised loss on that instance before predicting. The thesis: each test instance defines its own learning problem. In 2020 the auxiliary loss was image-rotation prediction. In the 2024 sequence-modeling revival (TTT-Linear / TTT-MLP), it became an RNN inner state whose recurrence is itself a gradient step. TTT-Discover is the third act of the same idea: the “test instance” is now a scientific problem, the “auxiliary loss” is now an RL reward computed by the evaluator, and the “few gradient steps” become hundreds of RL updates on a single problem. The mechanism — train at test time, not just predict — is unchanged.
The single distinction from AlphaEvolve. AlphaEvolve: frozen LLM + evolutionary outer loop. TTT-Discover: the LLM’s weights are updated by RL using the evaluator’s scores as reward; the next batch of proposals is drawn from the updated model. Everything else is shared.
The Core Loop
核心循环
For each problem \(P\) paired with an evaluator \(E\), TTT-Discover instantiates a fresh copy of a pretrained policy \(\pi_{\theta_0}\) (the paper uses gpt-oss-120b, OpenAI’s open-source 120B-parameter model) and iterates:
init: policy π_θ = gpt-oss-120b
problem P with evaluator E
rollout buffer B = ∅
for step i = 0, 1, 2, ..., N:
# 1. PROPOSE: sample candidates from current policy
candidates = π_θ.sample(P, batch_size=k)
# 2. SCORE: deterministic evaluator
rewards = E(candidates)
B.extend(zip(candidates, rewards))
# 3. TRAIN: RL update on the rollouts
θ ← rl_update(θ, B) # ← the TTT step
# 4. KEEP BEST so far
best = max(best, max(rewards))
return best
Single-turn rollouts, multi-step training. Each call to
π_θ.sample(...)produces a single-turn generation — the model receives a prompt (problem spec plus a few high-scoring past candidates as in-context examples) and emits one complete artifact in one shot: a Python program (math, AtCoder, biology) or a CUDA/Triton kernel (GPUMode). It does not see compiler errors, benchmark timings, or RNA-seq scores within a single rollout. The reward is computed once on the finished artifact and becomes the credit signal for the policy-gradient update on that whole output sequence. The “iteration” happens at the RL-step level — the next batch is drawn from a different policy that has absorbed the gradient signal from this batch — so the model gets progressively better at writing correct kernels first try across training steps, not across turns within one rollout. This contrasts with agentic systems like SWE-agent or Sakana’s AI CUDA Engineer, which keep the LLM frozen and instead wrap it in a multi-turn scaffold (write → compile → read error → fix → re-run → …). TTT-Discover’s bet: in domains with a fast deterministic verifier, updating the weights buys more than giving the model more turns at the same compute.
The substantive line is step 3. AlphaEvolve has steps 1, 2, 4 — and replaces step 3 with “stuff top-scoring candidates into the next prompt as in-context examples.” TTT-Discover instead does an actual RL gradient update, so that the distribution from which step 1 samples shifts toward higher-reward regions of program space as the search proceeds.
The empirical signature is visible in the reward distribution per step. The figure below, from the paper’s project page, plots the reward distribution at steps 0, 9, 24, and 49 on the GPUMode TriMul kernel competition. The whole distribution drifts upward over the course of the run — not just the maximum, but the median, the tails, the modes. The model is becoming progressively more skilled at this one problem.
(For background on GPUMode and the TriMul kernel, see the GPU kernel warm-up above.)
Reward Design, Training Recipe, and Cost
奖励设计、训练配方、成本
The 2020 TTT paper used a self-supervised auxiliary loss because vision OOD has no labels at test time. TTT-Discover’s domain is different: every problem comes with an evaluator \(E: \mathrm{candidate} \to \mathbb{R}\) — did the program compile and run? did the proof check in Lean? does the kernel produce correct output, and how fast? Those scores are labels — imperfect (high score doesn’t imply optimum), but enough to drive RL.
Two design consequences follow.
The evaluator is the bottleneck. The method’s leverage comes entirely from how informative and inexpensive \(E\) is. Each step of TTT-Discover requires evaluating a batch of candidates; the inner loop is bounded by evaluator throughput. Math conjectures get this for free (closed-form scoring); GPU kernels need a correctness checker plus a timed benchmark; competitive programming needs hidden test cases; biology needs a held-out reference. The four case studies were chosen specifically because each has a fast, reliable, densely-informative scorer.
The base model has to be capable. A weak model produces samples whose average reward is too low for the gradient to find signal. gpt-oss-120b is large enough to occasionally hit high-reward candidates from cold-start prompting — RL then amplifies that occasional hit into a reliable mode of behavior. Trying TTT-Discover on top of a 7B model would likely fail not because the method is wrong but because RL needs something to reinforce.
Practical details that matter for reproducibility:
- Backbone: gpt-oss-120b, open-source. One fresh fine-tunable copy per problem.
- Optimizer & RL objective: standard policy-gradient setup with the rollout buffer as on-policy data; details follow recent open-source RL-for-LLM recipes.
- Steps per problem: on the order of 50 RL updates (the figure shows steps 0, 9, 24, 49); each step processes a batch of new rollouts.
- Sampling at each step: thousands of candidates per step in the engineering domains; fewer when each evaluation is expensive (biology).
- Wall-clock per problem: hours to a day on commodity GPU clusters.
- Dollar cost per problem: “a few hundred dollars” per the paper — an order of magnitude below the implied AlphaEvolve per-problem cost.
The cost asymmetry comes from two places. First, gpt-oss-120b on rented GPUs is much cheaper per token than frontier-API access to a model like Gemini 2.5 Pro. Second, RL is sample-efficient compared to evolutionary search — a gradient step uses every sample in the buffer; an evolutionary step uses only the survivors. The same compute spend extracts more signal.
Math Results: Erdős & Autocorrelation
数学结果:Erdős 与自相关
Who was Erdős, and what's the minimum overlap problem? (Click to expand)
Paul Erdős (1913–1996) was a Hungarian mathematician famous for his extraordinarily prolific output — over 1500 published papers with 500+ collaborators — and for the habit of posing open problems with small cash prizes attached to their solutions. The minimum overlap problem is one of his characteristic combinatorial questions, posed in 1955.
Informally: partition the integers from 1 to 2n into two equal-sized sets A and B. Now slide one of them — consider B shifted by every integer k. The two sets B+k and A will overlap in some number of points. The minimum overlap problem asks: how small can you make the largest such overlap, by choosing A and B cleverly? Formally, define M(n) = min over partitions of max over shifts k of |A ∩ (B + k)|. Erdős conjectured that the limiting density c = limn→∞ M(n)/n is strictly bounded away from 0 and from 1/4.
The problem is hard because it's extremal: even the most evenly-spread partitions have a "bad shift" that pushes the overlap up. No closed form for c is known; every published bound corresponds to an explicit numerical construction. Improving the bound is a long-running goalpost in additive combinatorics.
For a partition of \(\{1, \dots, 2n\}\) into two equal-sized sets \(A\) and \(B\), define \(M(n) = \min_A \max_k \vert A \cap (B + k) \vert\) — the smallest possible peak overlap as you slide \(B\) over \(A\). The Erdős minimum overlap conjecture asks for the value of \(c = \lim_n M(n)/n\). Improving either bound on \(c\) has been an open problem for decades; AlphaEvolve made the first numerical progress since 2016, and TTT-Discover advances it further.
| Method | Year | Best bound on \(c\) |
|---|---|---|
| Last published human result | 2016 | 0.380927 |
| AlphaEvolve | 2025 | 0.380924 |
| TTT-Discover | 2026 | 0.380876 |
The TTT-Discover number drops the bound by about \(5 \times 10^{-5}\) below AlphaEvolve — small in absolute terms but the kind of move that takes a competition-grade specialist a year of work. Both numbers correspond to explicit constructions (partitions of \(\{1, \dots, 2n\}\) found by the search), so the result is verifiable.
The autocorrelation inequality is a sibling problem: extremal autocorrelations of indicator functions of subsets of integers. AlphaEvolve had previously pushed it; TTT-Discover takes the same problem and improves further (concrete numbers are reported alongside the Erdős result in the paper’s tables). The methodological point of the math results is not that TTT-Discover solves either conjecture — neither is solved — but that on shared problems and shared compute budgets, the test-time-training lever beats the evolutionary-search lever. That comparison is cleaner than comparing a TTT system against a different method on different problems.
Engineering: GPU Kernels and AtCoder
工程:GPU Kernel 与 AtCoder
The most operationally impressive result is on the GPUMode TriMul kernel competition: write a CUDA/Triton kernel that computes a particular triangular matrix multiplication, matching reference correctness and beating the human leaderboard on wall-clock time. The reward signal is straightforward: correctness gate × time speedup. The same TTT-Discover run was repeated on four different accelerator generations:
| Hardware | Best human (μs) | TTT-Discover (μs) | Speedup |
|---|---|---|---|
| NVIDIA A100 | 4,531 | 2,198 | 2.06× |
| AMD MI300x | 2,462 | 1,596 | 1.54× |
| NVIDIA H100 | 1,371 | 1,161 | 1.18× |
| NVIDIA B200 | 1,005 | 905 | 1.11× |
Several things to read out of this table. First, the absolute speedup is larger on older / slower chips (A100, MI300x) and shrinks on newer ones (H100, B200) — the older chips leave more room for optimization that human contestants didn’t fully exhaust, while newer chips’ theoretical peak is closer to the human submissions’ wall-clock. Second, the cross-hardware consistency — every chip beaten, including AMD — suggests the method isn’t overfitting to one vendor’s quirks. Third, the human “best” entries are the result of organized public-competition work, not casual baselines, which makes the comparison meaningful.
What's AtCoder, and how does a Heuristic Contest differ from an algorithm contest? (Click to expand)
AtCoder is a Japanese competitive programming platform, comparable in scale and prestige to Codeforces. It runs two main contest formats with very different rules.
Algorithm contests are short sessions (2–5 hours) with problems graded pass / fail — your solution either matches the expected output on every hidden test case or it does not. They reward correctness and asymptotic efficiency.
Heuristic contests (AHC) are different. They run for one to two weeks, problems are NP-hard or otherwise admit no exact polynomial-time solution, and your score is a continuous value — e.g., total cost of a schedule, number of items packed, area covered. Top submissions are the product of dozens of incremental refinements over the contest window by an experienced competitive programmer. Beating the leader by even a small margin is therefore non-trivial: it requires inventing or implementing a heuristic that the field's best didn't try.
AtCoder Heuristic Contests (AHC) are long-form competitive programming problems where the score is continuous (not pass/fail) and the goal is to maximize it — closer in flavor to “construct a good solution” than to “find the algorithm.” TTT-Discover was run on past AHC contests with the contest’s own scoring used as reward.
| Contest | Domain | Best human score | TTT-Discover score |
|---|---|---|---|
| AHC39 | Geometry | 566,997 | 567,062 |
| AHC58 | Scheduling | 847,674,723 | 848,414,228 |
The margins are small but above the best human, which on contests of this scale means the top of several hundred competitive programmers, often with multiple submissions per contestant. Same method, same model, two very different domains.
Biology: Single-Cell RNA Denoising
生物学:单细胞 RNA 去噪
Single-cell RNA-seq, PBMC, Tabula, and the classical denoisers. (Click to expand)
Single-cell RNA sequencing (scRNA-seq) measures mRNA abundance for every gene in every individual cell. The output is a (cells × genes) count matrix where entry (i, j) is approximately the number of mRNA molecules from gene j detected in cell i. Typical experiments produce 10,000 to 100,000 cells by ~20,000 genes. The matrix is brutally sparse — each cell only has a few thousand non-zero entries — and is contaminated by technical artifacts (uneven capture efficiency, amplification bias, ambient mRNA from lysed cells).
Downstream analyses (cell-type clustering, trajectory inference, marker-gene discovery) are sensitive to that noise, so denoising is a critical preprocessing step. The widely-cited classical denoisers are MAGIC (Markov affinity-based graph imputation), DCA (deep count autoencoder), and scVI (single-cell variational inference); each trains a dedicated model on the dataset.
PBMC = peripheral blood mononuclear cells, a benchmark dataset of immune cells from blood. Tabula refers to the Tabula Sapiens / Tabula Muris cell atlases — large multi-organ scRNA-seq references widely used to evaluate denoising and integration methods.
The biology case study pushes the method farthest from its comfort zone. Single-cell RNA sequencing produces sparse, noisy gene-expression counts; classical denoising methods (MAGIC, DCA, scVI) require training a domain-specific model on the dataset. TTT-Discover instead asks gpt-oss-120b to write a denoising program — a Python script that takes the raw count matrix and emits a denoised one — and uses Pearson correlation against a held-out reference as the reward signal.
| Dataset | Best human method | TTT-Discover |
|---|---|---|
| PBMC | 0.64 | 0.71 |
| Tabula | 0.64 | 0.73 |
Two-to-nine points of Pearson correlation is a meaningful margin on these benchmarks. The interesting design choice is what TTT-Discover outputs: a program, not a denoised matrix. The model is trained to be good at writing denoisers for this dataset, not at denoising itself. That inheritance from FunSearch and AlphaEvolve — programs as the unit of discovery — keeps the artifact interpretable: a biologist can read the generated Python and slot it into a real pipeline. A model that outputted a matrix directly would be hard to trust the same way.
Where This Sits, and What Comes Next
这工作的位置与下一步
Place the three milestones on a 2×2:
| Single-domain (bespoke) | Open-domain (LLM-based) | |
|---|---|---|
| Frozen at inference | Pre-2020 supervised ML | FunSearch (2023), AlphaEvolve (2025), o1/R1-style scaling |
| Updated at inference | AlphaProof’s contest-time RL (2024, narrow) | TTT-Discover (2026) |
(For reference: AlphaProof is DeepMind’s Lean-based theorem-proving system that earned a silver-medal score at IMO 2024 by running its RL training loop during the contest itself — the one prior public example of inference-time weight updates for scientific discovery, but only in the narrow Lean-proof domain.)
TTT-Discover claims the previously empty bottom-right cell: a general-purpose discovery system whose LLM is trained on whatever single problem you point it at, using off-the-shelf open weights as the cold start. The math, engineering, and biology results are the proof that this quadrant is not just theoretically interesting but currently the strongest publicly-reported configuration on the problems where direct comparisons exist.
Four threads worth flagging, two practical and two conceptual.
Practical. Does cost stay flat as problems get harder? The reported few-hundred-dollar figure is presumably averaged over problems where RL did successfully climb the reward landscape. Frontier-difficulty problems may cost much more, or fail to converge at all. Can the trained per-problem weights be reused? The paper treats each problem in isolation by design, but the specialized weights are a real asset — a “fine-tuned for kernels” model could plausibly cold-start later kernel runs faster.
Conceptual. Where does the prior come from? TTT-Discover’s RL does not invent new mathematical machinery; it amplifies structural priors already inside gpt-oss-120b’s pretraining. As problems drift further from anything in the pretraining mix, the leverage RL provides over the base model presumably shrinks. Is “test-time training” still the right framing? When you train for hundreds of GPU-hours on a single problem, the line between “test-time” and “training-time” has blurred almost beyond recognition. The deeper claim is that the unit of training has shifted from a corpus to a problem — a thesis Yu Sun’s group has been building toward since 2020, and the strongest test of which is whether the next paper extends the trajectory.
Si et al. (2026): When the Task IS AI Research
Si 等人 (2026):当任务就是 AI 研究本身
While TTT-Discover puts the LLM on math, GPU kernels, AtCoder, and biology, a contemporaneous paper from Chenglei Si, Zitong Yang, Tatsu Hashimoto et al. at Stanford takes the same execution-grounded recipe and points it at a strikingly self-referential target: the task is to do AI research itself. “Towards Execution-Grounded Automated AI Research” (Si et al., Jan 2026) converts two real research workloads — LLM pre-training and post-training — into executable environments, then asks whether LLMs can discover better recipes through automated trial-and-error on real GPU runs. Its findings agree with TTT-Discover in some places and disagree in others, and the disagreements are the most informative part.
Turning AI Research Into an Executable Benchmark
把 AI 研究变成可执行 benchmark
What makes “AI research” an unusual search target is the absence of a clean external verifier. There is no correct pretraining recipe — only one that’s empirically better on held-out loss; there is no correct RLHF objective — only one whose downstream evaluations look better. Si et al.’s contribution is largely infrastructure: an automated executor that takes a proposed idea (natural-language description plus code skeleton), implements it in PyTorch, launches a real GPU training run, evaluates the resulting model on held-out tasks, and emits a single reward score back to the search loop.
Two environments anchor the paper:
- Pre-training environment. The reward is wall-clock time to reach a target validation loss on a nanoGPT-scale training run. Proposed ideas can touch the model architecture, the optimizer, the data ordering, the learning-rate schedule, anything inside the training script. Baseline: stock nanoGPT.
- Post-training environment. The reward is task accuracy after a fixed-budget RL fine-tune of a small base model. Proposed ideas can rewrite the entire post-training recipe — reward shaping, PPO hyperparameters, the rollout-collection scaffold, the data mix. Baseline: a standard post-training pipeline.
Both environments are deliberately real: candidates are not graded in a toy simulator but by actually launching GPU training runs of minutes to hours. That makes the evaluator expensive — orders of magnitude slower than TTT-Discover’s kernel benchmark or math scorer — but also makes the search target identical to what an ML researcher actually optimizes.
Two Learning Loops on the Same Executor
同一执行器、两种学习循环
Once the environment exists, the paper layers two learning algorithms on top of it — each playing the role TTT-Discover’s RL loop plays in the parent narrative:
| Evolutionary search | Reinforcement learning | |
|---|---|---|
| Inner loop | LLM proposes idea → executor runs training → score → keep top-k → seed next prompt | LLM proposes idea → executor runs training → score → policy-gradient update on the LLM |
| LLM weights | Frozen (like FunSearch / AlphaEvolve) | Updated (like TTT-Discover) |
| Idea diversity | Explicit (population maintained) | Implicit (entropy of the policy) |
This is precisely the design comparison the rest of the post has been pointing toward. Holding the executor fixed and varying only the learning loop above it isolates which lever — evolution or RL — is actually doing the work.
Results: Evolution Wins, RL Mode-Collapses
结果:演化胜,RL mode-collapse
The headline numbers:
| Task | Baseline | Best evolved | Best RL |
|---|---|---|---|
| Pre-training: reach nanoGPT target loss | 35.9 min | 19.7 min | (not best) |
| Post-training: task accuracy after fixed-budget RL | 48.0% | 69.4% | (mode-collapsed) |
Three findings carry over:
- Evolutionary search is sample-efficient on this domain. With a frontier LLM as the mutation operator, the loop finds genuine improvements on both tasks — a ~45% wall-clock reduction on the pre-training benchmark and a 21-point accuracy bump on the post-training benchmark. The FunSearch / AlphaEvolve template ports cleanly to the AI-research task.
- Frontier LLMs saturate early. After a modest number of generations, the proposed-idea quality plateaus: the model has emitted everything in its prior that’s plausibly relevant, and further evolution mostly recombines the same elements. This is the same frozen-weight ceiling AlphaEvolve hits on harder math problems — but here, on a domain (AI research) where the LLM has seen extensive pretraining data, the ceiling arrives surprisingly quickly.
- RL mode-collapses. The RL variant pushes the average reward up but fails to push the maximum. The policy converges on a narrow band of “safe” ideas that score reliably above baseline, but rarely emits the high-variance candidates whose tail occasionally lands on a much better recipe. The upper end of the distribution stays flat while the lower end rises — the opposite of TTT-Discover’s reward-distribution plot from the GPUMode TriMul run.
Why RL Wins for TTT-Discover and Collapses Here
为什么 RL 在 TTT-Discover 上赢、在这里塌缩
The contrast is the most useful thing to extract from Si et al. — and it sharpens the design space the parent narrative implies. Two factors plausibly explain the divergence.
1. Reward density and reliability. TTT-Discover’s evaluators emit a dense, deterministic, low-variance score per candidate: kernel time in microseconds, scalar value of the math objective, AtCoder simulator score, RNA-seq held-out loss. Each rollout produces a clean credit signal. Si et al.’s evaluator launches an entire training run — minutes to hours — whose final score depends on initialization, data shuffling, and dozens of nuisance variables. The same proposed idea can score 48% on Monday and 52% on Tuesday. RL with a noisy reward is well-known to collapse toward low-variance behaviors that game the noise rather than the underlying objective; evolutionary search, which keeps survivors only by rank, is more robust to that kind of noise.
2. Sample budget per evaluation. TTT-Discover can draw thousands of kernel candidates per RL step — each kernel evaluation is sub-second on a single GPU. Si et al.’s training-run evaluator caps the realistic per-step sample count at tens to low hundreds. Policy-gradient RL needs many on-policy samples to estimate the gradient; below some threshold, the gradient is dominated by noise and the policy collapses toward its prior mean. Evolutionary search has no such threshold — it can extract signal from a single survivor.
The honest reading. RL-at-inference works in domains with cheap, dense, deterministic verifiers; evolutionary search remains the safer default everywhere else. TTT-Discover deliberately picked four such domains; Si et al. show what happens when you take the same recipe out of that comfort zone. Neither paper invalidates the other — they jointly chart the operating envelope.
What Si et al. Adds to the Map
Si 等人在地图上加了什么
The paper’s most interesting contribution is not a new system but a new target: an executable benchmark for AI research itself. Most of the benchmarks surveyed in the next section measure agents on someone else’s research workflow (Kaggle, ICML papers, GitHub issues). Si et al.’s environments measure them on the workflow that produces the agents themselves. If a future system can close this self-reference loop — propose pretraining recipes that, when implemented, yield models that are better at proposing pretraining recipes — that’s a qualitatively different milestone from anything else in this post. Si et al.’s contribution is to make that loop concretely measurable for the first time.
BixBench (2025): When the Verifier IS an LLM
BixBench (2025):当验证器本身就是 LLM
Every system in this post so far has been graded by a verifier that is a piece of code: assert is_cap_set(S) for FunSearch, a CUDA timing loop for AlphaEvolve and TTT-Discover, a training-loss target for Si et al. The Si et al. section ended on an observation — RL-at-inference works when the verifier is cheap, dense, and deterministic, and collapses when it is noisy. That framing implicitly assumes a verifier exists. What about the science problems where it doesn’t?
BixBench (Mitchener et al., FutureHouse / ScienceMachine, arXiv 2503.00096, v3 Oct 2025) is the cleanest published attempt to grade open-ended bioinformatics analysis without writing a verifier per task. The agent is given a Jupyter notebook, a folder of raw data, and a research question; it must explore, write Python/R/bash, and produce a defensible answer. There is no assert that can check the answer. So the paper makes a different bet: the verifier is itself an LLM. Unpacking how the bet is made tractable — and what the results say about current frontier models — is the right closing move before the wider benchmark map.
Capsule Curation: Real Bioinformaticians Build the Scenarios
Capsule 整理:真实生物信息学家造场景
A BixBench capsule is the unit of evaluation:
hypothesis + input data + reference analysis code + result + true/false answer
The data are not synthetic. FutureHouse recruited PhD-level bioinformaticians through professional networks and paper-author outreach, gave them a Google Colab template, and asked each to either recapitulate a published analysis or run a de novo analysis on data of their choice. Analysts write Python, R, or bash inside the notebook; public datasets are pulled with wget, private ones uploaded as files. The author team and a second reviewer sign off before a capsule enters the corpus.
The final corpus has 61 capsules spread across the natural sub-fields of computational biology — Figure 2 of the paper plots the distribution: Genomics is the largest bin, followed by Transcriptomics, Differential Expression, RNA-seq, Phylogenetics, Whole-Genome Sequencing, Variant Analysis, Imaging, Epigenetics, Single-Cell Analysis, and roughly ten more categories.
From each capsule, Claude 3.5 Sonnet drafts MCQ questions in two rounds of four (eight drafts per capsule); analysts approve, reject, or edit; a final dedup pass (Claude again, run in triplicate, ~95% concordance) trims to 205 final open-answer questions — about 3.8 per capsule. The open-answer and the MCQ versions reference the same ground truth.
The Agent: A Notebook, Three Tools, a Docker Container
Agent:一个 notebook、三个工具、一个 Docker 容器
The agent is dropped into a pre-built Docker container (BixBench-env:v1.0) loaded with Python, R, bash, and the standard bioinformatics stack — pandas, scanpy, bioconductor packages, anything an analyst would have installed. The action space is intentionally minimal:
-
edit_cell(idx, code)— write or modify a Jupyter cell; executing it re-runs the entire notebook, exposing tracebacks, tables, plots -
list_workdir()— recursive directory inspection -
submit_answer(text)— end the trajectory
Trajectory length is unbounded. The 4.2-hour average wall-clock per task (Table 1 in the paper) puts BixBench in the same shape-of-cost ballpark as RE-Bench (8 h) — orders of magnitude longer than KernelBench-style millisecond evaluations, but still bounded enough to evaluate 200+ questions in a single research compute budget.
The framework — Aviary, FutureHouse’s open-source agent gym — is the same scaffold used by the group’s earlier scientific-agent work. BixBench’s contribution at this layer is not the scaffold but the environment: capsules, packages, ground truth, and the eval harness, all reproducible from the repo at github.com/Future-House/BixBench.
The Evaluator: An LLM Judge
Evaluator:LLM 作 judge
This is the design choice that defines BixBench, and it’s worth stating directly.
For open-answer evaluation, after submit_answer, a separate Claude 3.5 Sonnet instance receives (agent’s answer, ground-truth result) and emits a binary 0/1 score — the LLM is the judge. Five parallel trajectories per question are averaged to control trajectory variance.
For MCQ evaluation, after the analysis trajectory the full notebook plus the original question is handed to a second Claude with multiple-choice options now revealed. The agent’s open-answer is also included. The second LLM picks an option — including, in some experimental conditions, an “insufficient information” refusal option.
The shape is structurally the same as PaperBench’s LLM-as-judge pipeline, but the choice has consequences:
- It lets BixBench handle arbitrary research questions without writing a verifier per task — the only ground truth needed is “the analyst’s stated result,” a single sentence.
- It introduces an evaluator that can itself be wrong. The paper does not run an inter-judge agreement experiment or human–judge calibration. The reader has to read the 21% number as “what a Claude judge thinks Claude / GPT-4o got right,” not as a ground truth.
The MCQ regime exists as a fallback: when the open-answer judge is too noisy, MCQ constrains output to a finite set the judge can match exactly. The cost is that “did the agent reach the right conclusion?” becomes “did the agent pick the right option from a list the LLM drafted from the same notebook the agent saw?” — a subtly different and somewhat easier question.
Headline Results, and the Refusal Ablation
标志数字与 refusal ablation
The headline number is sobering:
| Regime | GPT-4o | Claude 3.5 Sonnet | Random / recall baseline |
|---|---|---|---|
| Open-answer | ~15% | ~21% | ~3% (recall, no notebook) |
| MCQ w/ refusal | ~20% | ~25% | ~20% (random, 4 opts + refuse) |
| MCQ w/o refusal | ~33% | ~40% | ~25% (random, 4 opts) |
Three observations from the paper.
- Open-answer is the honest evaluation. Claude 3.5 Sonnet caps at 21%; GPT-4o at 15%. The recall baseline — same question, same model, no notebook access — sits at 3%. So the agent is doing real work in the notebook; it’s just bad at it.
- The refusal option exposes a calibration failure. In MCQ with refusal, both models hover at random. They opt out — even when the correct option is visible — whenever they don’t trust their reasoning. Remove the refusal option and forced-pick accuracy doubles. That gap is a behavioral signature, not a competence one: the model knows it’s uncertain and prefers not to commit.
- Vision and majority voting don’t help. Letting the agent generate plots vs. forcing text-only outputs has no significant effect (Figure 5, bottom). Majority voting across \(k = 5\) trajectories flattens around the single-run mean (Figure 5, top). Neither knob recovers signal.
One more disclosure worth flagging: the paper excluded reasoning models (o1, DeepSeek R1) because they struggled with long-context tool use during preliminary tests. So the 21% ceiling reflects the non-reasoning GPT-4o / Sonnet 3.5 era; later reasoning-trained models (o3, Sonnet 3.7+, Gemini 2.5 Thinking) have not been evaluated in the v3 release. The authors’ own future-work note explicitly identifies BixBench’s binary capsule reward as well-suited to the same RL-from-binary-reward recipe that TTT-Discover demonstrates.
Why BixBench Matters for This Post
BixBench 对本文的意义
Two ways to read BixBench against everything else here.
First, as a stress test on the LLM-as-judge approach. Almost every “AI scientist” benchmark in the broader map below (PaperBench, AI-Scientist v2, MLRC-Bench’s harder tiers, MLGym’s open tasks) has the same shape: the agent does something open-ended; an LLM grades the output. BixBench is the most carefully-curated specimen of this family, and its 21% ceiling is informative. Whether 21% means “models are bad” or “the judge is bad” is not something the paper resolves — and that uncertainty itself is the cost of doing without a verifier.
Second, as the right counterpoint to the TTT-Discover / Si et al. axis. TTT-Discover’s four chosen domains all have crisp deterministic verifiers; that is precisely what lets RL-at-inference work there. Si et al. showed RL collapses when the verifier is noisy. BixBench shows what the third regime — no verifier at all, only an LLM judge — actually looks like in practice. The honest summary is that benchmarks in this regime cannot yet distinguish 21%-correct from 50%-correct on the underlying science, only on what one fixed judge believes about it. Any successor system that posts a high BixBench score should be cross-checked by a human bioinformatician before the number is trusted — a discipline none of the LLM-as-judge benchmarks has yet established at the field level.
The next section drops the case-study lens and surveys the broader landscape — but with this caveat in mind: scores from LLM-judged benchmarks are not directly comparable to scores from verifier-judged benchmarks, even when both are reported as a percentage.
FrontierCS & FrontierSmith (2025–26): The Open-Ended Frontier, and a Data Engine for It
FrontierCS 与 FrontierSmith (2025–26):开放式前沿,以及喂养它的数据引擎
Two coupled papers from an overlapping author group (Mang, Chai, He, Mao, Zhou …, with Gonzalez, Dimakis, Cheung, and Shang as senior authors) sharpen the thesis this post keeps circling — verify is easy, construct is hard — into a clean benchmark and, separately, into a way to manufacture training data for it. They pair naturally: FrontierCS (Dec 2025) is the measuring stick; FrontierSmith (May 2026) is the factory.
FrontierCS: 156 Problems Where the Optimum Is Unknown
FrontierCS:156 个最优解未知的问题
FrontierCS is a benchmark of 156 open-ended CS problems, curated and reviewed by experts (CS PhDs, top competitive-programming setters). Its defining move is the one FunSearch and AlphaEvolve made implicit: every task has no known optimal solution, but a solution whose quality is objectively scorable. Models don’t emit an answer — they submit an executable program, graded by an automatic evaluator with partial scoring against an expert reference. Two tracks share the same construct-hard / verify-easy property: algorithmic (mostly NP-hard variants of competitive-programming problems) and research problems.
A concrete case makes the property tangible. Take an algorithmic-track task in the mold of online bin packing — the problem this post already met in the FunSearch section: items of varying size stream in one at a time and must each be dropped immediately into a unit-capacity bin; minimize the number of bins. You submit a program that picks a bin for each item. The asymmetry is the whole point: proving the true minimum for a large instance is NP-hard — nobody knows the optimum — yet scoring any program is trivial: run it on a battery of hidden instances and count the bins it actually uses. A submission averaging 412 bins unambiguously beats one averaging 419, even if neither is optimal. That number is the partial score; the expert reference solution only fixes the scale. The research track has the same shape — a sharper approximation algorithm or system design whose quality is read off a measured objective, not a pass/fail test.
The headline is a frontier wall, stated plainly: frontier reasoning models still lag far behind human experts on both tracks, and increasing the reasoning budget alone does not close the gap. Note that “optimize” here has nothing to do with training — no weights are updated. It is the agent’s inference-time loop: the model submits an executable program, the evaluator returns a score, and over many turns the model uses that environment feedback to iteratively push its program closer to a solution. The sharpest qualitative finding — the one that rhymes with everything else here — is that within that loop models over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs: round after round they spend their turns getting something that runs and passes, then stop, rather than continuing to drive the partial score upward toward a better design. A model trained on closed-ended unit tests reaches for the first solution that runs; an open-ended scorer rewards the one that runs best, and that gap is exactly where today’s models fall down.
This is the CS-wide generalization of TTT-Discover’s hand-picked four. Where TTT-Discover proved a method on four deterministic-verifier domains, FrontierCS builds the yardstick — 156 of them — and finds that frontier models, run as plain solvers, are nowhere near the human frontier. It is the same shape open-ended scientific-discovery benchmarks keep producing: a verifier that admits partial credit, frontier models that stall well short of it, and more compute that does not rescue them.
FrontierSmith: Synthesizing Open-Ended Problems at Scale
FrontierSmith:规模化合成开放式问题
If FrontierCS shows models are weak at open-ended construction, the obvious next question is training data — and open-ended problems are scarce and expensive to author by hand. FrontierSmith is an automated system that evolves open-ended problems out of closed-ended seeds. Starting from ordinary competitive-programming problems, it generates candidate variants by changing the goal, restricting the output, and generalizing the input, then keeps only the ones that matter via a quantitative “idea-divergence” metric — a candidate survives only if different solvers attack it in genuinely different ways. Agents then write test cases and verifiers for the survivors, closing the loop into trainable tasks.
The payoff is measurable. Training on the synthesized data lifts base models substantially: Qwen3.5-9B gains +8.82 on FrontierCS and +306 Elo on ALE-bench; Qwen3.5-27B gains +12.12 and +309. And a telling side effect — the synthesized problems make agents take more turns and spend more tokens, just like human-curated ones — is evidence that closed-ended seeds, evolved correctly, can stand in for expensive human authoring of long-horizon problems.
Two details connect straight to the rest of this series. First, the idea-divergence selection is the same instinct as SMDD-Bench’s diversity leaderboard — does the agent find genuinely distinct solutions, or collapse onto one? Diversity is becoming a first-class signal, in open-ended evaluation and now in data generation, not an afterthought. Second, “more turns, more tokens” is precisely the cost structure of open-ended agentic work: long trajectories are where both the value and the expense of open-ended discovery live.
What the Pair Adds to the Map
这一对给地图补上了什么
Read together, the two papers close a loop the earlier milestones left open. FunSearch and AlphaEvolve supplied the search loop; TTT-Discover added training at inference time; Si et al. and BixBench stress-tested the verifier. FrontierCS and FrontierSmith supply the missing benchmark-and-data axis: a hard, partial-credit yardstick that today’s frontier models fail, and a scalable engine to manufacture the open-ended training problems needed to climb it. The flywheel is now visible end to end — generate problems (FrontierSmith) → train (the TTT-Discover-style loop) → measure against an open-ended frontier (FrontierCS) → repeat. The gap that remains is the one this post keeps returning to: the science domains — materials, chemistry, drug design — where the verifier is itself expensive or wet-lab-bound, and the frontier wall is taller still.
Beyond the Four Domains: The Wider Benchmark Landscape
超越四个领域:更广的 Benchmark 全景
TTT-Discover targets four problems by design: an Erdős extremal-combinatorics conjecture, the GPUMode TriMul kernel competition, AtCoder Heuristic Contests, and single-cell RNA denoising on OpenProblems. Each was chosen for a fast deterministic verifier and a meaningful human baseline. Over the last 18 months, the LLM-agent community has built out a much larger benchmark ecosystem with the same shape — and any successor work in this line will be measured against the ecosystem rather than the hand-picked four. This section sketches the map.
A Four-Criteria Filter for "AI Scientist" Benchmarks
四标准过滤器:什么算"AI Scientist" Benchmark
Across the recent literature, benchmarks credibly framed as “AI scientist” or “research agent” evaluations tend to satisfy four properties:
- Open-ended. The agent picks among many viable solution paths; there is no single canonical answer to copy from training data.
- Ideation matters. Performance is bounded by what non-obvious approach the agent tries, not just execution speed. Hyperparameter tuning alone shouldn’t win.
- Multi-step environment. The agent interacts with a runnable artifact — file system, compiler, simulator, dataset. Reward comes from program execution, not string matching.
- Non-trivial rollout horizon. Each evaluation takes minutes to hours, which is where iterative search and learned approximate scorers start to pay off.
Several widely-cited benchmarks fail at least one criterion and are therefore measuring something different. SWE-bench Verified problems mostly have a single hidden correct patch (criterion 1 fails for most instances). FrontierMath and PutnamBench score in milliseconds against a single numerical or symbolic answer (criterion 4 fails). BrowseComp asks for a single short string (criterion 3 fails). These are not bad benchmarks — they just measure other things (closed-ended program repair, math-answer recall, web retrieval).
The Stratified Stack
分层栈
The benchmarks that do pass the filter organize, somewhat naturally, into tiers of increasing research demand:
| Tier | What the agent does | Representative benchmark | Year |
|---|---|---|---|
| ML engineering | Win a Kaggle competition | MLE-bench (75 Kaggle competitions, OpenAI) | 2024 |
| Research engineering | Beat human ML-RE workflows | RE-Bench (7 tasks, 71 human-expert attempts, METR) | 2024 |
| Methodological innovation | Beat baselines via new methods | MLRC-Bench (7 recent ML-conference contests) | 2025 |
| Long-horizon optimization | Hill-climb a scoring function for days | ALE-Bench (AtCoder Heuristic Contests, Sakana) | 2025 |
| GPU kernel optimization | Write a CUDA / Triton kernel that beats reference | KernelBench (Stanford, 250 PyTorch tasks) | 2024 |
| Full paper reproduction | Build a paper end-to-end | PaperBench (20 ICML 2024 papers, OpenAI) | 2025 |
| End-to-end paper generation | Pick topic, execute, write | AI-Scientist v2 (Sakana, open-ended) | 2025 |
Each tier raises the bar from “given a goal, can you reach it?” toward “can you set your own goal and reach it?” Current frontier LLMs perform passably at the engineering end and progressively worse moving up — o1-preview + AIDE reaches Kaggle bronze in 16.9% of MLE-bench competitions; the best agent on MLRC-Bench closes only 9.3% of the gap between baseline and top human. Performance has not yet saturated at any tier.
A second family sits one level deeper, more useful for training agents than for headline leaderboards: MLAgentBench (Stanford / Snap, 13 ML tasks), MLE-Dojo (a gym-style trainable extension of the MLE-bench distribution with explicit SFT/RL support), MLGym (Meta + NYU, 13 open-ended tasks across CV / NLP / RL / game theory), and DSBench (466 data analysis + 74 data modeling tasks from ModelOff + Kaggle).
The figure below plots ~40 of these benchmarks on a 2D map — domain on the x-axis, research intensity on the y-axis. Each marker is a benchmark; color encodes year. Hover for a one-line description; click to open the paper or repo.
ML & Research Engineering — Deep Dive
ML 与研究工程——细读
The most populated tier — every team has reproducible ML workflows to evaluate, so this is where benchmarking converged first.
-
MLE-bench (OpenAI, arXiv 2410.07095, Oct 2024) — 75 Kaggle competitions across vision, NLP, tabular, signal, biology. Agents train models, prepare datasets, submit predictions; graded against Kaggle’s actual leaderboard medal thresholds. Best published result: o1-preview + AIDE scaffolding reaches Kaggle bronze in 16.9% of competitions. Rollouts span hours to dozens of hours per competition. Note: OpenAI paused new leaderboard submissions on 24-Apr-2026 pending a v2 fairness overhaul; published o1-preview / AIDE numbers and the AIRA Toledo et al. follow-ups remain the comparable anchors. Code: github.com/openai/mle-bench.
-
RE-Bench (METR, arXiv 2411.15114, Nov 2024, ICML 2025) — 7 ML research-engineering environments: scaling-law fitting, GPU kernel optimization, LLM training task search, restricted MLP architecture search, etc. Designed with feedback from frontier-lab research engineers. 71 8-hour attempts by 61 distinct human experts open-sourced; agent transcripts at transcripts.metr.org. One AI agent wrote a faster custom Triton kernel than any of the human experts. 2h / 8h / 32h time-budget protocol. Among the richest trajectory datasets available. Authors request submitters keep solutions out of training data. Code: github.com/METR/RE-Bench.
-
PaperBench (OpenAI, ICML 2025) — reproduce 20 ICML 2024 Spotlight/Oral papers from scratch: codebase + experiments + result matching. 8,316 individually graded sub-tasks via author-co-developed rubrics + LLM judge (SimpleJudge). ML PhDs achieve 41.4%; Claude 3.5 Sonnet New reaches 21.0%. Full PaperBench costs roughly $4,000 USD per model on average; PaperBench Code-Dev variant is cheaper and GPU-free (~$2,000), though the paper notes Code-Dev correlates only “weakly” with full PaperBench. Code: github.com/openai/preparedness.
-
MLAgentBench (Stanford / Snap, arXiv 2310.03302, NeurIPS 2024) — 13 ML experimentation tasks: CIFAR-10 improvement, BabyLM, Kaggle challenges. ReAct-style scaffold with file ops, code execution, output inspection. Best agent (Claude 3 Opus) hits 37.5% average success. Foundational predecessor to MLE-bench / MLRC-Bench. Some tasks (CIFAR-10) now saturated. Code: github.com/snap-stanford/MLAgentBench.
-
MLRC-Bench (arXiv 2504.09702, Apr 2025, NeurIPS 2025) — 7 ML research competition tasks from recent ML conferences (LLM merging, machine unlearning, perception, few-shot). Explicitly designed to require novel methodological innovation beyond hyperparameter tuning. Best agent (gemini-exp-1206 under MLAB scaffold) closes only 9.3% of the gap between baseline and top human — currently the strongest measure of how far frontier LLMs are from doing methodological research. Designed as a “dynamic” benchmark that incorporates new competitions. Code: github.com/yunx-z/MLRC-Bench; HF Spaces leaderboard.
-
MLE-Dojo (arXiv 2505.07782, May 2025) — Gym-style interactive environment built on 200+ Kaggle challenges with a dedicated training set of 150 tasks. Explicitly supports SFT and RL — the trainable counterpart to MLE-bench. Sampling trajectories provided for both. Heavy overlap with MLE-bench’s Kaggle distribution, so most teams pick one or the other depending on whether they need trainable RL trajectories (MLE-Dojo) or industry-standard leaderboard comparability (MLE-bench). Code: github.com/MLE-Dojo/MLE-Dojo.
-
MLGym / MLGym-Bench (Meta + NYU, arXiv 2502.14499, Feb 2025) — Gym-style ML research environment with 13 open-ended tasks spanning CV, NLP, RL, game theory. Baselines from Claude 3.5 Sonnet, Llama-3.1-405B, GPT-4o, o1-preview, Gemini-1.5 Pro. CC-BY-NC 4.0 license — limits commercial deployment. Authors note current frontier models mostly tune hyperparameters in practice rather than ideating. Code: github.com/facebookresearch/MLGym.
-
AI-Scientist v1 & v2 (Sakana, v1 arXiv 2408.06292, 2024; v2 arXiv 2504.08066, 2025). v1 generates end-to-end ML papers on nanoGPT, 2D diffusion, grokking templates. v2 removes templates and uses progressive agentic tree search; one v2-generated paper was accepted at the ICBINB workshop at ICLR 2025 (then voluntarily withdrawn per a pre-agreement). Per ICBINB organizers’ ICLR 2026 proposal, past editions accepted roughly 35 of 40 submissions (~87.5%), far above main-conference rates — treat the “first AI-generated peer-reviewed paper” headline accordingly. ~$15–25 per full paper. Caveat: this is an agent system, not a benchmark with standardized scoring; evaluation requires LLM-as-judge or human review. Code: github.com/SakanaAI/AI-Scientist (~13.8k stars) and github.com/SakanaAI/AI-Scientist-v2 (~6.4k stars).
-
DSBench (ICLR 2025, arXiv 2409.07703) — 466 data-analysis tasks + 74 data-modeling tasks from ModelOff and Kaggle. Long contexts, multimodal data, multi-table reasoning. Best agent solves 34.12% of analysis tasks. More “analyst” than “scientist,” but a useful adjacent reference. Code: github.com/LiqiangJing/DSBench.
Scientific Discovery & Reproduction — Deep Dive
科学发现与复现——细读
The “scientist-like” tier — domain-grounded tasks closer to actual research workflows than to ML engineering.
-
TTT-Discover’s open-sourced environments (arXiv 2601.16175, Jan 2026) — the four domains the rest of this post analyzed are themselves fully open-sourced: Erdős minimum-overlap & autocorrelation inequalities (closed-form continuous reward; AlphaEvolve set 0.380924, TTT-Discover improved to 0.380876); GPUMode TriMul kernel competition (H100/A100 timing as reward; TTT-Discover 1161μs vs human 1371μs on H100); AtCoder Heuristic Contests (overlaps ALE-Bench; 1st place in ahc039); single-cell denoising on OpenProblems (training on Pancreas, evaluation held out on PBMC and Tabula Muris Senis Lung; Poisson / MSE score). Code: github.com/test-time-training/discover with sandbox reward evaluators and ALE-Bench / GPUMode integration.
-
DiscoveryBench (AI2, arXiv 2407.01725, Jul 2024) — 264 real data-driven discovery tasks from 6 domains (sociology, biology, humanities, economics, engineering, meta-science) + 903 synthetic tasks. Each task = dataset + metadata + natural-language discovery goal. Hypotheses are constrained by available columns — “open-ended” but bounded by the schema. 100+ citations by mid-2025; foundational AI2 reference. Code: github.com/allenai/discoverybench; HF dataset.
-
ScienceAgentBench (OSU, arXiv 2410.05080, ICLR 2025) — 102 tasks from 44 peer-reviewed publications in bioinformatics, computational chemistry, GIS, cognitive neuroscience. Agent must produce a self-contained Python program. Best agentic framework solves ~32.4% (rising to 34.3% with expert knowledge). Code: github.com/OSU-NLP-Group/ScienceAgentBench; HAL leaderboard at hal.cs.princeton.edu.
-
BixBench (FutureHouse, arXiv 2503.00096, v3 Oct 2025) — 61 real-world bioinformatics capsules with 205 open-answer questions; agents run Python/R/Bash in a Jupyter + Docker environment. LLM-as-judge evaluator; Claude 3.5 Sonnet tops at ~21% open-answer, GPT-4o at ~15%. Adopted by Biomni Lab, Edison Analysis, and multiple agent providers. Full case study in the dedicated section above. Code: github.com/Future-House/BixBench; HF dataset.
-
LAB-Bench (FutureHouse, 2024) — 2,400+ multiple-choice questions on practical biology research capabilities. MCQ format limits its use as an open-ended research-agent eval, but a reference for “does the model know the standard bio workflows.”
-
CORE-Bench (Princeton, arXiv 2409.11363, TMLR 2025) — 270 tasks from 90 papers across CS / social science / medicine focused on computational reproducibility (rerun existing code/data). 3 difficulty levels; best agent: 21% on hardest level. Narrower than PaperBench but more grounded.
-
SUPER (Allen AI, arXiv 2409.07440, EMNLP 2024) — 45 expert end-to-end + 152 sub-problems + 602 auto-generated problems on setting up and executing research repos. GPT-4o solves 16.3% e2e / 46.1% sub-problems. Gold solutions and baseline trajectories at github.com/allenai/super-benchmark.
-
RExBench (Tin Lab, arXiv 2506.22598, Jun 2025) — 12 realistic research-extension tasks (NLP/ML), each built on existing papers with domain-expert-written extension instructions. Best agent: ~33% without hints, ~44% with hints. Extensions are given, not discovered. Code: github.com/tinlaboratory/RExBench (gated HF dataset).
Code, Kernels, and Software Engineering — Deep Dive
代码、Kernel 与软件工程——细读
-
KernelBench (Stanford, arXiv 2502.10517, Dec 2024, ICML 2025) — 250 PyTorch ML workload tasks across 4 difficulty levels (single ops, fused ops, full models, HuggingFace model optimization). Metric:
fast_p= correct AND >p× faster than PyTorch. Native single-turn but explicitly supports iterative refinement with execution + profiler feedback; DeepSeek R1 with 10 turns reaches 43% / 72% / 18% across Levels 1/2/3. New kernelbench.com v3 (multi-GPU) and “Hard” (Blackwell) variants released Q1/Q2 2026; KernelBench-X extension exists; follow-ups include Meta’s KernelEvolve, AMD’s GEAK Triton agents, Sakana’s reproductions. Best-paper at DL4C @ ICLR 2025; ICML 2025. The field’s central GPU-kernel-LLM benchmark. Code: github.com/ScalingIntelligence/KernelBench. -
ALE-Bench (Sakana, arXiv 2506.09050, Jun 2025) — past AtCoder Heuristic Contest tasks: score-based NP-hard optimization (routing, scheduling, packing). The only major benchmark explicitly designed for objective-driven, score-based, long-horizon algorithmic engineering (contests run for one to two weeks). Supports interactive agents with test-run feedback and visualizations. ALE-Agent ships as reference baseline. Backbone for TTT-Discover’s algorithm domain. Code: github.com/SakanaAI/ALE-Bench; HF dataset at hf.co/datasets/SakanaAI/ALE-Bench.
-
SWE-bench Verified / Pro / EVO — the repository-bug-fix family. SWE-bench Verified is 500 human-filtered SWE-bench instances; mostly closed-ended (hidden unit tests) with 1–2 line fixes for 161 of 500 problems. SWE-Bench Pro (Sep 2025) is 1,865 problems from 41 enterprise-style repos with long-horizon, multi-file changes (avg 21 files modified). SWE-EVO (Dec 2025) provides 48 release-note-based tasks with avg 874 tests per instance. All three measure software engineering capability but a single canonical patch is hidden behind tests — doesn’t satisfy “open-ended ideation.” Useful adjacent baselines.
-
SWE-Gym (Berkeley, Dec 2024, ICML 2025) — 2,438 real-world Python tasks with executable runtime, unit tests, NL specs. Designed for training, not just evaluating. The entire fine-tuned model + trajectory dataset are open. 32.0% on SWE-Bench Verified with a 32B model. R2E-Gym extension (arXiv 2504.07164) provides 8.1K procedurally-generated environments for even more trajectory data — strong for trajectory pretraining, weak on ideation.
-
GSO (arXiv 2505.23671, May 2025, NeurIPS 2025 D&B) — 102 software optimization tasks mined from real commits in NumPy, Pandas, Pillow, llama.cpp, etc. across 5 languages. Agent gets codebase + perf test, must beat expert dev optimization. “Hack Detector” anti-gaming. Frontier SWE-agents get <5% OPT@1. Agent trajectories explicitly released at gso-bench.github.io. Closest cousin of KernelBench in spirit.
-
Commit0 (Cornell + Cohere, arXiv 2412.01769, Dec 2024) — 54 entire Python libraries built from scratch from API specs + interactive unit tests. Best agent: 29.30% subset / 6.12% full.
-
Aider Polyglot (Dec 2024) — 225 hard Exercism problems in 6 languages. Light multi-step: two attempts with unit-test feedback. Closed-form puzzles; useful for “does the model code?” but not for ideation-bound evaluation.
-
SWE-Lancer (OpenAI, 2025) — freelance software-engineering tasks scored by code review / payout. Less mature trajectory ecosystem than the SWE-bench family.
Math, Theorem Proving, and Formal Reasoning — Deep Dive
数学、定理证明、形式推理——细读
-
FrontierMath (Epoch AI, 2024) — per Epoch AI’s benchmark page: “The full FrontierMath dataset contains 350 problems. This is split into a base set of 300 problems, which we call Tiers 1–3, and an expansion set of 50 exceptionally difficult problems, which we call Tier 4.” Single-answer (numerical / symbolic match), sub-second eval. Excellent capability signal but doesn’t reward search.
-
PutnamBench (arXiv 2407.11214) — 640 Putnam problems formalized in Lean 4, Isabelle, and Coq (417 in Coq). A proof attempt is multi-step in the prover but eval is binary pass/fail with fast verification.
-
MiniF2F — 488 olympiad-level problems formalized across Lean / Isabelle / HOL Light / Metamath. The longest-standing formal math benchmark.
-
LeanDojo / Lean Workbook — end-to-end Lean 4 environment with retrieval; Lean Workbook has ~57K formal-informal problem pairs. Better positioned than PutnamBench for open-ended search because the environment is multi-step at tactic level, though eval is still cheap.
-
Erdős-style open math problems (via TTT-Discover) — the autocorrelation inequality and minimum-overlap problems used by TTT-Discover are the only “open-ended math” tasks that genuinely satisfy continuous reward + ideation matters + multi-step search + recognized math identity. As a benchmark suite they exist only as TTT-Discover’s sandbox.
Adjacent Ecosystems: Paper Writing, Web Agents, Games, Evolutionary Reproductions
相邻生态:论文写作、Web Agent、游戏、演化搜索复现
Paper writing. ResearchBench (Shanghai AI Lab, Mar 2025) covers inspiration retrieval + hypothesis composition + hypothesis ranking over 1,386 papers in 12 disciplines — but it’s a static text benchmark, no multi-step environment. ReportBench (ByteDance) is survey-writing eval via citation overlap + statement validation — same disqualifier. The community-run ML Reproducibility Challenge is annual but with informal eval. AI-Scientist v2 generates papers but evaluation falls back to LLM-as-judge or workshop submission.
Web agents. WebArena, VisualWebArena, WorkArena, Mind2Web provide multi-step browser environments with binary task-completion rewards — useful adjacent baselines if web research is in scope. BrowseComp (OpenAI, Apr 2025) is 1,266 hard short-answer web research questions — single-answer format, sparse reward; OpenAI Deep Research already saturates the easy portion. BrowseComp-Plus (ACL 2026) adds a fixed corpus with human-verified supporting docs; better-controlled but still single-answer.
Game-based RL environments. NetHack via NLE, Crafter, MineDojo, Voyager; BALROG (ICLR 2025) aggregates several. Craftax (JAX, multi-orders-of-magnitude speedup) supports fast iteration. These satisfy “open-ended” and “long-horizon” but the broader ML community frames them as embodied / long-horizon planning testbeds, not “AI scientist” benchmarks. Including them in a research-agent portfolio dilutes the AI-for-science narrative.
Robotics curricula. RoboHive, RLBench, LIBERO are well-developed for robot learning but not “AI scientist” benchmarks.
Evolutionary-search reproductions. AlphaEvolve is closed source. OpenEvolve and CodeEvolve reproduce its template — use these if you want AlphaEvolve-style environments. AlphaEvolve’s reported numbers — per Google Cloud’s “AlphaEvolve on Google Cloud” blog, “AlphaEvolve sped up a vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini’s training time” — are DeepMind-reported and not independently verified. FunSearch domains (cap-set, online bin packing on OR-Library OR1–OR4 + Weibull 5K, cyclic graphs) are public at github.com/google-deepmind/funsearch but the repository ships code, not a bespoke benchmark suite.
Optimization-flavored. OPT-BENCH (arXiv 2506.10764, Jun 2025) is 20 Kaggle ML tasks + 10 NP-hard combinatorial problems with multi-step interaction — moderate fit. OptiBench is single-shot LP/MILP modeling.
Hardware / systems optimization beyond kernels. No widely-accepted benchmark; KernelBench and GSO are the closest the community currently has.
Materials, chemistry, drug discovery. No public benchmark currently matches all four criteria. ChemBench exists as a knowledge / MCQ eval. The most credible attempts at agentic discovery in these areas remain proprietary (Anthropic’s microbiology results, Isomorphic Labs).
TTT-Discover's Four Domains, Mapped
TTT-Discover 四个领域对照公开 Benchmark
To reproduce TTT-Discover or compare against it, each of its four domains has a corresponding community-recognized benchmark:
| TTT-Discover domain | Public benchmark | Notes |
|---|---|---|
| Erdős minimum overlap / autocorrelation | TTT-Discover’s open sandbox at github.com/test-time-training/discover; AlphaEvolve-style reproductions in OpenEvolve and CodeEvolve | No standalone community benchmark; the problems themselves are recognized math |
| GPUMode TriMul kernel | KernelBench (250 PyTorch tasks across 4 difficulty levels, fast_p metric) | TTT-Discover used GPUMode directly; KernelBench is the community-standard generalization |
| AtCoder Heuristic Contests | ALE-Bench (past AHC tasks, score-based continuous reward, ALE-Agent reference baseline) | TTT-Discover used AHC directly; ALE-Bench is the now-standard wrapper |
| Single-cell RNA denoising | OpenProblems (single-cell benchmarking framework); BixBench (53 bioinformatics scenarios at notebook-level) | OpenProblems is the underlying dataset; BixBench is the agent-task wrapper |
The pattern is consistent: TTT-Discover picked the underlying problem source for each domain (GPUMode, AtCoder, OpenProblems) but did not adopt the benchmark wrappers the community has since standardized on (KernelBench, ALE-Bench, BixBench). A successor system can plug into the wrappers directly.
Two adjacent ecosystems share the same open-ended-plus-multi-step-plus-executable shape but extend in different directions:
- Data-driven scientific discovery. DiscoveryBench (AI2; 264 real data-driven discovery tasks across sociology, biology, humanities, economics, engineering, meta-science), ScienceAgentBench (OSU; 102 tasks from 44 peer-reviewed publications in bioinformatics, comp. chemistry, GIS, cognitive neuroscience), and BixBench push toward analytical workflows rather than just engineering.
- Software optimization. GSO (102 software optimization tasks mined from real commits in NumPy, Pandas, Pillow, llama.cpp across 5 languages) is the closest cousin of KernelBench in spirit. The SWE-bench / SWE-Gym / R2E-Gym family handles repository-level software engineering, though most instances of SWE-bench Verified are closed-ended bug fixes.
What's Still Missing
仍然缺什么
Three notable gaps as of mid-2026:
- No materials / chemistry / drug-discovery benchmark satisfies all four criteria. ChemBench is multiple-choice. The most serious “agentic discovery” efforts in these areas are either proprietary (Anthropic’s microbiology work, Isomorphic Labs) or require wet-lab experiments — currently out of reach for pure-software agents.
- No paper-writing benchmark with closed-loop scoring. ResearchBench (1,386 papers across 12 disciplines) and ReportBench measure surface properties — citation overlap, hypothesis ranking, statement validation — but don’t reward end-to-end execution. AI-Scientist v2 generates papers but evaluation falls back to LLM-as-judge or workshop submission, both noisy and inflated by workshop acceptance rates.
- Game-based environments — NetHack, Crafter, MineDojo, Voyager, BALROG — are open-ended and long-horizon but the community does not treat them as “AI scientist” benchmarks. They sit cleanly on the “embodied / long-horizon planning testbed” branch instead.
The takeaway: TTT-Discover’s four-task hand-pick was effective at proving the method, but it’s one slice of a much wider landscape. The same RL-at-inference loop should plug, with minimal changes, into MLE-bench, RE-Bench, KernelBench, ALE-Bench, MLRC-Bench, and PaperBench Code-Dev. The next paper in this line will likely report on exactly that — and the comparison against frontier models will move with it.