AI for Scientific Discovery: Three Milestones and a Benchmark Map
A Warm-Up: The Cap Set Problem
热身:Cap Set 问题
Before we look at any method, let’s spend a few minutes on one concrete problem all three systems will try to solve — the cap set problem. We’ll work up to the formal statement slowly, starting from a card game you may have played. If you already know the problem from additive combinatorics, you can skim this section.
Start With the Card Game SET
从 SET 卡牌游戏说起
You may know the card game SET. Each card has four attributes — color, shape, number, shading — and each attribute takes one of three values (red/green/purple, oval/squiggle/diamond, 1/2/3, solid/striped/open). So the full deck has \(3 \times 3 \times 3 \times 3 = 81\) cards. A “SET” is three cards such that, for each of the four attributes, the three cards are either all the same or all different. Players race to spot SETs on a shared table.
Hidden inside the game is a natural puzzle:
The no-SET puzzle. Lay out a bunch of cards on the table. What’s the largest number of cards you can put down with no three of them forming a SET?
The answer is 20. You can lay 20 cards from the 81-card deck without any three forming a SET; you cannot lay 21. The constructions that achieve 20 are not obvious — they have specific structure. Our goal in the rest of this warm-up is to turn this innocent-looking card-game question into a mathematical problem you can actually attack.
Encoding the SET Rule as Math
把 SET 规则编码为数学
To reason about the no-SET puzzle we leave the cards behind and work with numbers. Label each attribute value as 0, 1, or 2. (For color, say red = 0, green = 1, purple = 2. The choice is arbitrary; what matters is that each attribute has three labels.) A card now becomes a 4-tuple of digits, e.g., \((0, 2, 1, 1)\) = “red, diamond, 2 shapes, striped”. The 81-card deck is exactly all possible such tuples. Mathematicians write this collection as \(\mathbb{F}_3^4\) — but you can read the symbol as nothing more than “4-tuples whose entries are 0, 1, or 2.”
Now look at when three cards form a SET. Pick any one attribute (one coordinate) and consider the three cards’ values there. By the SET rule, the three values are either all the same or all different. Sum the three values in each case:
- All same, e.g. \(1 + 1 + 1 = 3\). Divisible by 3.
- All different — three values \(0, 1, 2\) in some order — \(0 + 1 + 2 = 3\). Divisible by 3.
- Two same one different (a non-SET case), e.g. \(0 + 0 + 1 = 1\). Not divisible by 3.
So the SET rule, attribute by attribute, is exactly the rule “three values sum to a multiple of 3.” Stacking the four attributes back together:
Three cards \(x, y, z\) form a SET \(\iff x + y + z = (0,0,0,0)\) in \(\mathbb{F}_3^4\), where addition is taken coordinate-wise mod 3.
This is the small but important translation step. The puzzle “what’s the largest no-SET collection of cards?” has become a clean mathematical question:
What’s the largest subset of \(\mathbb{F}_3^4\) in which no three distinct elements sum to zero (mod 3)?
Cap Sets in F3n, and Why They're Hard
F3n 中的 Cap Set,以及它为什么难
There’s nothing magic about four attributes. Replace 4 with any positive integer \(n\) and you get a whole family of problems — the cap set problem:
Given \(\mathbb{F}_3^n\) (the set of \(n\)-tuples with entries in \(\{0, 1, 2\}\)), find the largest subset such that no three distinct elements sum to zero coordinate-wise mod 3. A subset with this property is called a cap set in \(\mathbb{F}_3^n\).
Small cases can be computed (or, for \(n \leq 6\), even proved exactly):
| \(n\) | 1 | 2 | 3 | 4 (SET deck) | 5 | 6 |
|---|---|---|---|---|---|---|
| Max cap set size | 2 | 4 | 9 | 20 | 45 | 112 |
For \(n \geq 7\) the exact answer is unknown; only upper and lower bounds exist. For \(n = 8\) the largest known explicit construction sits at 496 — and improving that by even one element is genuine research. FunSearch produced a construction of size 512 for \(n = 8\), which is exactly such a research-level advance.
Why is the problem so hard? The space \(\mathbb{F}_3^n\) has \(3^n\) elements, so it has \(2^{3^n}\) subsets. For \(n = 8\) that’s \(2^{6561}\), a number with roughly 1975 decimal digits. Brute-force search over subsets is hopeless. You need a construction — a clever recipe that builds a large subset while systematically avoiding any three elements that sum to zero. Such constructions are rare and typically come from deep structural insights: group theory, error-correcting codes, additive number theory.
The slow progress is telling. Between 2002 and 2023, the largest known cap set in \(\mathbb{F}_3^8\) was improved by humans only a handful of times. FunSearch’s 2023 contribution is in this slow line.
To get a visceral sense for how the constraint bites, try the simplest non-trivial case yourself. The widget below is a 3×3 grid of \(\mathbb{F}_3^2\). Click points to add them to your cap set; any three selected points whose coordinates sum to \((0, 0)\) mod 3 will be flagged with a red dashed triangle. The maximum cap set in \(\mathbb{F}_3^2\) has size 4 — try to find one. The surprise is usually how few points you need before some “wrap-around” triple snaps shut.
Why This Is the Right Yardstick
为什么它是合适的标尺
The cap set problem has three properties that make it nearly ideal as a benchmark for any LLM-driven discovery system:
- Mechanically verifiable. A few lines of Python check whether a given subset is a cap set: iterate over triples, check if any sums to zero mod 3. There is no judgment call, no need for a human or another LLM to grade the output.
- No closed form. There is no formula to derive the answer from. Every improvement is genuinely new mathematics, not a rederivation.
- Connected to deep mathematics. Cap-set sizes control the asymptotic exponent \(\omega\) of matrix multiplication via the Cohn–Umans framework. The problem is also the simplest non-trivial instance of “find dense subsets that contain no three-term arithmetic progressions” — a foundational question in additive combinatorics related to Roth’s theorem.
These three properties — easy to verify, no closed form, connected to deeper structure — are exactly what we’ll want from every benchmark in this post. FunSearch picked the cap set problem as one of its two flagship demonstrations because of them. We’ll see analogues in AlphaEvolve (the Erdős minimum-overlap problem) and TTT-Discover (the same Erdős problem, plus the autocorrelation inequality, GPU kernels, AtCoder, biology). The shape of the problem matters as much as the method.
With this concrete problem in mind, we can now look at how FunSearch, AlphaEvolve, and TTT-Discover try to solve it.
FunSearch (2023): Searching the Space of Programs
FunSearch (2023):在程序空间中搜索
The path from “LLMs as conversational assistants” to “LLMs as discovery engines” was opened by FunSearch (Romera-Paredes et al., Nature, Dec 2023). It established the template every subsequent system — including TTT-Discover — has built on: a frozen pretrained LLM proposes candidate programs, a deterministic evaluator scores them, an evolutionary outer loop selects survivors and stuffs them back into the next prompt. Understanding what FunSearch did, and where its design choices start to bind, is the cleanest entry point to the rest of this post.
The Idea: Search Over Programs, Not Outputs
核心思想:在程序而非输出上搜索
The non-obvious move in FunSearch is that the LLM is asked to produce programs that solve the problem, not answers to the problem. For a problem like “find the largest cap set in \(\mathbb{F}_3^n\),” there are at least three plausible search targets:
- The answer directly — a literal set of vectors. Hard to mutate locally, hard to characterize diversity over, and the search space scales with the answer size.
- A solver — a program that, given \(n\), returns a cap set. Compact, interpretable, runs in milliseconds, gives a clean scalar score.
- An informal proof or argument — what GPT-style chain-of-thought defaults to. Hard to verify automatically; needs an LLM-as-judge that itself becomes a noise source.
Why is a program verifiable when chain-of-thought isn’t? Both produce a final answer. The asymmetry is in how you check the answer, not in whether one exists. FunSearch’s deliverable is code: the evaluator runs it and the constructed object (a size-512 cap set, a packing of \(10^6\) items) is generated at evaluation time, then checked mechanically — a single
assertconfirms “is this a cap set?”. Chain-of-thought’s deliverable is prose, which leaves two bad options: either embed the answer literally in the text (a 512-vector list rarely survives an LLM completion intact, and parsing it back out lands you in option 1 with all its scale problems), or argue informally toward it (now verification needs to follow reasoning, which a deterministic checker cannot do — you fall back to another LLM-as-judge or a strict proof checker like Lean, the narrow path AlphaProof takes). Running code is one line of evaluator; reading prose needs parsing, trust, or both.
FunSearch picks (2). Three properties follow:
- Compactness. A 20-line Python function can generate arbitrarily large outputs (e.g., a cap set for any \(n\)). Direct-output search over strings of size proportional to the answer cannot.
- Verifiability. Run the function and compute the score. No subjective judgment, no LLM-as-judge loop, no proof checker required.
- Interpretability. A mathematician can read the resulting Python and understand what construction was discovered — sometimes well enough to generalize it into a paper-quality proof.
The “Fun” in FunSearch is function: searching in the space of functions. That naming choice telegraphs the central design commitment.
Architecture: Frozen LLM + Evolutionary Loop + Sandboxed Evaluator
架构:冻结 LLM + 演化循环 + 沙盒 evaluator
FunSearch’s loop has four components:
- Pretrained LLM. DeepMind used Codey, a PaLM 2 variant fine-tuned for code, frozen throughout. Its only role is to take a prompt of existing programs and write a new one.
- Programs database. An island-based evolutionary population. Each “island” maintains its own pool of high-scoring programs; periodic inter-island migration prevents premature convergence.
- Prompt builder. Selects \(k\) programs from the database (biased toward high scorers), formats them as in-context examples, and asks the LLM to write a better one.
- Sandboxed evaluator. Runs the candidate on the target instance, returns a scalar score. Evaluation is fast (milliseconds) and deterministic.
Crucial constraint: FunSearch evolves one function of at most around 20 lines — the “priority function” or “heuristic” that embodies the central decision the search is optimizing. Everything around it (main loop, I/O, scoring) is fixed by the human designer. This narrow target keeps per-evaluation cost tiny and lets the loop process millions of candidates in one experiment. AlphaEvolve and TTT-Discover both relax this constraint (entire files vs. single functions), and the cost structure changes accordingly.
The frozenness of the LLM is the other binding constraint, and one we will keep coming back to. FunSearch’s pretrained Codey weights are identical at iteration 1 and iteration \(10^6\). Whatever skill the system shows is amortized over all the problems Codey was ever pretrained on, plus the in-context examples assembled at runtime.
Headline Results: Cap Sets and Online Bin Packing
标志性结果:Cap Set 与在线装箱
What's online bin packing? (Click to expand)
Online bin packing: items of various sizes arrive one at a time and must be placed immediately, without knowledge of future arrivals, into bins of fixed capacity. Goal: minimize the total number of bins used. The problem is a classical operations-research staple, modeling settings from VM placement in data centers to truck loading.
Standard heuristics include First-Fit (place in the first bin with room), Best-Fit (place in the tightest bin with room), and First-Fit-Decreasing (offline; sort first). Their competitive ratios are well-studied; improving on them under realistic distributions is non-trivial because the "obvious" tweaks tend to break worst-case guarantees.
FunSearch evolved a small priority function that beats First-Fit and Best-Fit by meaningful margins on standard benchmark distributions, and is readable enough that the underlying "waste-aware" decision rule can be explained in a single sentence.
Two flagship results.
Cap set. For multiple values of \(n\), FunSearch produced larger explicit cap sets in \(\mathbb{F}_3^n\) than any previously known construction. For \(n = 8\) it produced a cap set of size 512. This is a real advance: cap-set lower bounds had been a slow-moving target in additive combinatorics, with progress measured in rare exact constructions, and FunSearch contributed one. The discovered Python function was short enough that mathematicians could read it, simplify it, and study what symmetry it was exploiting.
Online bin packing. FunSearch evolved a priority function that outperforms First-Fit and Best-Fit on multiple benchmark distributions, with a decision rule readable enough to deploy. This was the first demonstration that an LLM-driven search could discover practical algorithmic heuristics whose performance is verified by running on real workloads — not merely solve an artificial puzzle.
The two results matter for different reasons. Cap sets show the system can advance a research-level open problem. Bin packing shows it can produce practical artifacts. Both are made possible by a frozen LLM — there is no learning happening, only sampling from a fixed prior plus selection. The size of that prior and the cleverness of the prompt builder do the work. This is the design point AlphaEvolve and then TTT-Discover would push against in opposite directions.
Why FunSearch Works: Verify Is Easy, Construct Is Hard
FunSearch 为什么能成功:验证容易,构造极难
Two complementary properties make cap set the right problem for an LLM-evolutionary loop. Together they let a noisy proposer beat decades of careful human work.
Verification is cheap and exact. Given any subset \(S \subseteq \mathbb{F}_3^n\), checking whether \(S\) is a cap set takes a few lines of Python:
import itertools
def is_cap_set(S, n):
"""Return True iff no 3 elements of S sum to (0,...,0) mod 3."""
for a, b, c in itertools.combinations(S, 3):
if all((a[k] + b[k] + c[k]) % 3 == 0 for k in range(n)):
return False
return True
That’s it. No subjective judgment, no LLM-as-judge, no learned verifier. The check runs in \(O(\vert S\vert^3 \cdot n)\) time — polynomial in the size of the answer, regardless of how clever the construction was. The evaluator runs this in milliseconds on a candidate and emits a clean numerical score (typically: \(\vert S \vert\) if valid, else 0).
Construction is exponentially hard. \(\mathbb{F}_3^n\) has \(3^n\) elements and \(2^{3^n}\) subsets. For \(n = 8\) that is \(2^{6561}\) — enumeration is hopeless. Worse, naive greedy construction (“add the next compatible element”) plateaus quickly: the largest cap sets require non-local choices, where committing to a slightly worse early element opens up a much larger valid extension later. Humans have produced cap-set lower bounds slowly over decades by inventing constructions from algebraic geometry, group-theoretic symmetries, and error-correcting codes — each requiring years of domain expertise to find a single new construction worth publishing.
This is the canonical NP-style asymmetry: a valid construction is poly-time checkable, but finding one (let alone an optimal one) is exponential in the worst case. Most interesting scientific-discovery problems share this shape — theorems are easy to verify in Lean, programs are easy to run, kernels are easy to benchmark, but inventing the artifact is hard.
What FunSearch outsources. Crucially, FunSearch does not enumerate cap sets or try to “understand” them mathematically. It evolves a small priority(el) function that scores individual elements of \(\mathbb{F}_3^n\). A fixed greedy driver then builds the cap set by taking elements in priority order, rejecting any that would form a line with two already-chosen elements:
import itertools
# === The ~20-line function FunSearch EVOLVES. ===
# The published priority function for n=8 sums several modular-arithmetic
# terms; the form below sketches its structure. See Romera-Paredes et al.
# 2023, Extended Data Fig. 3, for the exact code.
def priority(el: tuple[int, ...], n: int) -> float:
score = 0.0
for i in range(n):
score += LINEAR_COEFF[i][el[i]] # per-coordinate
for i in range(n):
for j in range(i + 1, n):
score += PAIR_COEFF[i][j][el[i]][el[j]] # pairwise
# ... a few more terms; the coefficients embed the discovered structure
return score
# === The FIXED driver. FunSearch does NOT evolve this. ===
def build_cap_set(n: int) -> list[tuple[int, ...]]:
elements = list(itertools.product(range(3), repeat=n))
elements.sort(key=lambda e: -priority(e, n)) # highest first
chosen: list[tuple[int, ...]] = []
for e in elements:
# Reject if e completes a line with any pair already chosen.
forms_line = any(
all((a[k] + b[k] + e[k]) % 3 == 0 for k in range(n))
for a in chosen for b in chosen if a < b
)
if not forms_line:
chosen.append(e)
return chosen
All the intellectual content lives in priority. The LLM proposes candidate priority functions; the evaluator runs build_cap_set and reports len(chosen); the evolutionary loop keeps the high-scoring ones and asks the LLM for variations. Over millions of evaluations, the search converges on priority functions whose induced greedy ordering produces unusually large cap sets — for \(n=8\), FunSearch’s discovered priority function produces a cap set of size 512.
Why this division of labor matters. Humans hit a cognitive wall when asked “how should I score an element to bias the greedy toward large constructions?” The question has millions of plausible answers, no closed form for the best one, and no obvious symmetry to exploit. A mathematician trying to write the priority function from scratch would think in terms of group orbits, linear codes, or quadratic forms — but the best priority function FunSearch finds may not match any of those: it’s a list of coefficients tuned through millions of greedy runs, each scored objectively. The LLM does not think mathematically; it samples short Python functions, each one a guess. Most are bad. A few are surprisingly good. The evaluator separates them cheaply.
The core asymmetry. Cheap to test, intractable to design. This is what lets a noisy LLM proposer plus a deterministic verifier discover what no human could write directly. AlphaEvolve and TTT-Discover both inherit this same asymmetry — on bigger problems (full code files, GPU kernels, biology pipelines) and with bigger budgets, but the underlying logic is identical.
AlphaEvolve (2025): Scaling the Template
AlphaEvolve (2025):把模板扩展上去
AlphaEvolve (DeepMind, 2025) inherits FunSearch’s shape and scales every axis of it: bigger models, longer programs, richer evaluator feedback, and the operational discipline to run all of this against problems that are economically meaningful at Google’s scale. The combination unlocks results that FunSearch’s constraints made out of reach.
What Changed: Functions to Files, One Model to Two
变了什么:函数到文件、单模型到双模型
Three architectural choices distinguish AlphaEvolve from FunSearch:
Two-tier LLM ensemble. A single run uses both Gemini 2.0 Flash (fast, cheap, generates the bulk of proposals) and Gemini 2.5 Pro (slower, smarter, used occasionally to seed harder candidates). Pure 2.5 Pro would be too expensive to run at evolutionary throughput; pure 2.0 Flash misses the deeper moves. The ensemble gets most of the throughput at Flash cost while keeping Pro’s creativity available where it counts.
Evolution at code-file granularity. Where FunSearch evolved single ≤20-line Python functions, AlphaEvolve evolves entire code files of hundreds of lines in any language. Candidates can carry non-trivial state, helper functions, even small frameworks. The trade-off is that each evaluation may take hours rather than milliseconds — but for high-value problems (kernel speedups, scheduler heuristics), that’s still worth it.
Parallel evaluators with rich feedback. Each candidate is dispatched to a parallel evaluator pool. The evaluator returns not just a scalar score but per-objective signals the next prompt can reference. The outer loop is still island-based evolution inherited from FunSearch, but populations and migration patterns are scaled up.
The cost asymmetry vs. FunSearch is the obvious downside: AlphaEvolve’s per-evaluation cost is orders of magnitude higher, so total candidates explored is orders of magnitude lower. The bet is that the richer search space (full files) and richer feedback (per-objective signals) more than compensate.
Benchmark Results: Math and Infrastructure
基准结果:数学与基础设施
What does this machinery actually produce? The headline results across math and infrastructure:
| Domain | Problem | Prior best | AlphaEvolve | Direction |
|---|---|---|---|---|
| Algorithm discovery | 4×4 complex matrix mult | 49 mults (Strassen 1969) | 48 mults | Fewer is better |
| Extremal combinatorics | Erdős minimum-overlap | 0.380927 (2016) | 0.380924 | Lower is better |
| Discrete geometry | Kissing number / packing | Various human records | Improved in several settings | Better bound |
| ML systems | Gemini training kernel | Hand-tuned baseline | ~23% speedup | Faster is better |
| Cluster scheduling | Borg scheduling heuristic | Existing production | ~0.7% compute recovered | Higher recovery |
| TPU hardware | Verilog circuit | Existing design | Reduced gate count | Fewer gates better |
What's Strassen's algorithm, and why is 48 mults a big deal? (Click to expand)
Volker Strassen showed in 1969 that the product of two 2×2 matrices can be computed with just 7 scalar multiplications (and more additions), not the obvious 8. Recursing this insight via block decomposition on n×n matrices gives runtime O(nlog2 7) ≈ O(n2.807) — the first algorithm beating the cubic O(n3) baseline. Applied once more to 4×4 = (2×2)×(2×2) the recursion gives 7×7 = 49 scalar multiplications.
That 49-mult count for 4×4 over the complex numbers stood unchallenged for 56 years, until AlphaEvolve found 48 in 2025. Earlier in 2022, AlphaTensor had already found a 47-mult algorithm for the 4×4 case — but only over the field of two elements ℤ2, not over the reals or complex numbers, so it does not apply to standard linear algebra.
The math results matter as proofs of capability: extremal problems with crisp, verifiable answers, on which AlphaEvolve unambiguously beat the human-mathematician state of the art (e.g., a 56-year-old Strassen record). The production results matter as proofs of value: deployed at Google scale, returning real compute and silicon savings. Both create non-trivial benchmarks for any successor system to target — including TTT-Discover, which directly reuses the Erdős minimum-overlap problem to claim a head-to-head win.
The Frozen-Weight Ceiling
冻结权重的天花板
Through AlphaEvolve’s entire pipeline, the LLM itself does not learn. Its weights at problem 1000 are bitwise identical to its weights at problem 1, just as FunSearch’s Codey weights were. All problem-specific behavior comes from in-context conditioning: the prompt for proposal \(N+1\) contains carefully-curated examples from \(N\). This is enormously powerful — frontier LLMs have huge implicit priors that the evolutionary loop can amplify — but it is also a ceiling.
Once the context budget is exhausted, the model cannot get better at this problem; the only remaining knob is running the loop longer and hoping a sample lands. Empirically, AlphaEvolve runs show diminishing returns after some number of generations on hard problems: the best score stops improving and the wall-clock cost per marginal advance balloons. The remedy that AlphaEvolve does not attempt is to update the model. TTT-Discover’s bet is precisely that moving the LLM’s weights themselves — modestly, RL-style, on the test problem — clears that ceiling at acceptable cost.
A Second Warm-Up: GPU Kernel Design
第二次热身:GPU Kernel 设计
TTT-Discover’s most operationally impressive results are on GPU kernels — beating hand-tuned human champions on four different chips. To appreciate why that’s a hard problem (and what TTT-Discover’s RL loop is actually searching over), we need a short tour of how GPU kernels become fast. The same warm-up-then-method pattern as the cap-set section: this part introduces the problem; the TTT-Discover section that follows shows the result.
What Is a GPU Kernel, and Why Is Making It Fast Hard?
GPU Kernel 是什么?为什么调它这么难?
A GPU kernel is a small program that runs in parallel across many threads on a GPU. A modern NVIDIA H100 has 132 streaming multiprocessors (SMs), each running thousands of threads concurrently — so a single kernel launch can have on the order of \(10^6\) thread instances in flight at once. The GPU schedules them across SMs automatically.
For matrix multiplication \(C = A \cdot B\) with \(C\) of size \(M \times N\), the simplest possible kernel — the naive kernel — assigns one thread per output element. Thread \((i, j)\) reads row \(i\) of \(A\), column \(j\) of \(B\), accumulates a \(K\)-long dot product, and writes \(C_{ij}\). Correctness is trivial; performance is terrible. Modern GPUs reach their headline TFLOPS only when the compute units are fed fast enough — and reading from HBM (high-bandwidth memory) is much slower per byte than computing on already-loaded data. The naive kernel re-reads each element of \(A\) and \(B\) many times, the SMs spend most of their cycles waiting on memory, and the achieved throughput is typically under 5% of peak.
The entire subfield of GPU kernel optimization is about closing this gap.
The Fundamental Tension: Compute vs Memory
基本张力:算力 vs 内存
Every kernel sits somewhere on a roofline:
- Memory-bound regime. Throughput is limited by how fast data arrives from HBM. To go faster, reduce bytes-moved per useful op.
- Compute-bound regime. Throughput is limited by how many FLOPs the SMs can do per cycle. To go faster, increase parallelism or use higher-throughput math units (tensor cores).
The dividing line is the kernel’s arithmetic intensity: FLOPs done per byte loaded. The crossover sits at \((\text{peak FLOPS}) \,/\, (\text{peak bandwidth})\). For an H100 with ~50 TFLOPS fp32 and ~3 TB/s memory bandwidth, that’s about 17 FLOPs/byte. Below 17, any further compute optimization is wasted; above 17, any further bandwidth optimization is wasted. Every well-tuned kernel pushes its arithmetic intensity firmly above the crossover. How depends on the operation. For matmul, the dominant tool is tiling.
Tile Size: The Key Knob
Tile Size:关键旋钮
Instead of one output element per thread, organize threads into blocks that cooperatively compute a \(T \times T\) tile of \(C\). The block first loads a strip of \(A\) (T rows, K columns) and a strip of \(B\) (K rows, T columns) into shared memory — on-chip SRAM, roughly 100× faster than HBM. The threads in the block then compute the entire \(T \times T\) tile from shared memory only, reusing each loaded element \(T\) times.
For each \(T \times T\) output tile:
- HBM bytes read: \(4 \cdot 2 T K = 8 T K\) (fp32)
- FLOPs done: \(2 T^2 K\) (one multiply + one add per \(T \times T \times K\) outer-product)
- Arithmetic intensity: \(T / 4\) FLOPs/byte
So doubling the tile size doubles the arithmetic intensity. Going from \(T = 1\) (the naive kernel) to \(T = 64\) takes you from 0.25 FLOPs/byte to 16 FLOPs/byte — right at the H100 crossover.
But \(T\) cannot grow without bound. Shared memory per block is only 48–228 KB, and each block needs \(4(T^2 + 2TK)\) bytes for the output tile and input strips. The register file and the total threads-per-block both impose further ceilings. If blocks become too big, you launch too few of them in parallel and lose occupancy — SMs sit idle for lack of work to schedule. So choosing \(T\) is a balancing act: large enough to be compute-bound, small enough to fit and to keep the SMs busy.
Try it yourself. The widget below models a simple square-tile matmul against representative GPU parameters (50 TFLOPS compute, 3 TB/s bandwidth, 48 KB shared memory per block). Slide \(T\) from 1 (the naive kernel) up through 128 (oversized) and watch the user point trace along the roofline.
The interesting band is roughly \(T \in [64, 96]\): large enough to be compute-bound, small enough to fit. Outside that band you’re leaving performance on the table. Real production kernels go much further — multi-level tiling (warp / thread / block), software pipelining, tensor-core packing, hand-tuned warp specialization, branch reduction, instruction-level scheduling — but the tile-size tradeoff is the first lever any optimizing kernel pulls.
Why TriMul Is Special (and What TTT-Discover Optimizes)
为什么 TriMul 特殊(以及 TTT-Discover 在优化什么)
The TriMul kernel is matmul with a twist: one input is restricted to upper- or lower-triangular, so half its entries are known to be zero. This pattern is ubiquitous in transformer training — the causal attention mask makes \(Q K^\top\) effectively triangular — so a fast TriMul translates directly into pretraining speedups for every causal-LM stack.
A naive kernel doesn’t know about the zeros. It loads them, multiplies by them, adds them, and wastes half the FLOPs. A specialized kernel must:
- Skip dead loads. Don’t fetch elements known to be zero — recover the lost bandwidth.
- Compress storage. Pack the triangular matrix into half the space — doubles effective bandwidth again.
- Branch-align tiles. Pick tile dimensions so each tile lies entirely above, below, or on the diagonal — never straddling, which forces warps to branch (all threads in a warp must take the same control path, so divergent branches serialize).
- Pipeline across the triangle. Reorder block iteration so warp issue rates stay high in the “thin” rows/columns near the diagonal.
These optimizations interact. The tile size that maximizes arithmetic intensity may put tile boundaries on the wrong side of the diagonal. The compressed layout that saves bandwidth may force ugly strided access patterns. Picking the best combination is what a competitive GPU engineer does over weeks of profiling.
This is the design space TTT-Discover walks. The system inherits an LLM’s prior knowledge of CUDA/Triton idioms, and uses RL to specialize a kernel for a specific TriMul shape on a specific chip. Each chip generation has different tradeoffs — H100 has tensor cores and 228 KB shared memory per block, A100 has 164 KB, AMD MI300x has wave-level instead of warp-level execution, B200 has yet another set of changes — so the optimal kernel for each chip is a different small program. The headline result you’ll see in the TTT-Discover section — beating GPUMode champions on four chip generations with one recipe — is impressive precisely because the same RL loop found a different optimum for each chip, with no human re-tuning.
TTT-Discover (2026): Training at Inference Time
TTT-Discover (2026):推理时训练
TTT-Discover keeps everything AlphaEvolve does outside the LLM — the evaluator, the prompt-building, the buffer of past programs — and changes one thing inside it: at every step, the LLM’s weights are updated by reinforcement learning, using the evaluator’s scores as reward. The rest of this section walks through what that one change buys, how it’s implemented, and how it performs on four very different problem domains.
Test-Time Training in One Paragraph
一段话讲完 Test-Time Training
Test-time training (TTT), introduced by Sun, Wang, Liu, Miller, Efros, Hardt at ICML 2020, takes a single test instance and runs a few gradient steps of a self-supervised loss on that instance before predicting. The thesis: each test instance defines its own learning problem. In 2020 the auxiliary loss was image-rotation prediction. In the 2024 sequence-modeling revival (TTT-Linear / TTT-MLP), it became an RNN inner state whose recurrence is itself a gradient step. TTT-Discover is the third act of the same idea: the “test instance” is now a scientific problem, the “auxiliary loss” is now an RL reward computed by the evaluator, and the “few gradient steps” become hundreds of RL updates on a single problem. The mechanism — train at test time, not just predict — is unchanged.
The single distinction from AlphaEvolve. AlphaEvolve: frozen LLM + evolutionary outer loop. TTT-Discover: the LLM’s weights are updated by RL using the evaluator’s scores as reward; the next batch of proposals is drawn from the updated model. Everything else is shared.
The Core Loop
核心循环
For each problem \(P\) paired with an evaluator \(E\), TTT-Discover instantiates a fresh copy of a pretrained policy \(\pi_{\theta_0}\) (the paper uses gpt-oss-120b, OpenAI’s open-source 120B-parameter model) and iterates:
init: policy π_θ = gpt-oss-120b
problem P with evaluator E
rollout buffer B = ∅
for step i = 0, 1, 2, ..., N:
# 1. PROPOSE: sample candidates from current policy
candidates = π_θ.sample(P, batch_size=k)
# 2. SCORE: deterministic evaluator
rewards = E(candidates)
B.extend(zip(candidates, rewards))
# 3. TRAIN: RL update on the rollouts
θ ← rl_update(θ, B) # ← the TTT step
# 4. KEEP BEST so far
best = max(best, max(rewards))
return best
Single-turn rollouts, multi-step training. Each call to
π_θ.sample(...)produces a single-turn generation — the model receives a prompt (problem spec plus a few high-scoring past candidates as in-context examples) and emits one complete artifact in one shot: a Python program (math, AtCoder, biology) or a CUDA/Triton kernel (GPUMode). It does not see compiler errors, benchmark timings, or RNA-seq scores within a single rollout. The reward is computed once on the finished artifact and becomes the credit signal for the policy-gradient update on that whole output sequence. The “iteration” happens at the RL-step level — the next batch is drawn from a different policy that has absorbed the gradient signal from this batch — so the model gets progressively better at writing correct kernels first try across training steps, not across turns within one rollout. This contrasts with agentic systems like SWE-agent or Sakana’s AI CUDA Engineer, which keep the LLM frozen and instead wrap it in a multi-turn scaffold (write → compile → read error → fix → re-run → …). TTT-Discover’s bet: in domains with a fast deterministic verifier, updating the weights buys more than giving the model more turns at the same compute.
The substantive line is step 3. AlphaEvolve has steps 1, 2, 4 — and replaces step 3 with “stuff top-scoring candidates into the next prompt as in-context examples.” TTT-Discover instead does an actual RL gradient update, so that the distribution from which step 1 samples shifts toward higher-reward regions of program space as the search proceeds.
The empirical signature is visible in the reward distribution per step. The figure below, from the paper’s project page, plots the reward distribution at steps 0, 9, 24, and 49 on the GPUMode TriMul kernel competition. The whole distribution drifts upward over the course of the run — not just the maximum, but the median, the tails, the modes. The model is becoming progressively more skilled at this one problem.
(For background on GPUMode and the TriMul kernel, see the GPU kernel warm-up above.)
Reward Design, Training Recipe, and Cost
奖励设计、训练配方、成本
The 2020 TTT paper used a self-supervised auxiliary loss because vision OOD has no labels at test time. TTT-Discover’s domain is different: every problem comes with an evaluator \(E: \mathrm{candidate} \to \mathbb{R}\) — did the program compile and run? did the proof check in Lean? does the kernel produce correct output, and how fast? Those scores are labels — imperfect (high score doesn’t imply optimum), but enough to drive RL.
Two design consequences follow.
The evaluator is the bottleneck. The method’s leverage comes entirely from how informative and inexpensive \(E\) is. Each step of TTT-Discover requires evaluating a batch of candidates; the inner loop is bounded by evaluator throughput. Math conjectures get this for free (closed-form scoring); GPU kernels need a correctness checker plus a timed benchmark; competitive programming needs hidden test cases; biology needs a held-out reference. The four case studies were chosen specifically because each has a fast, reliable, densely-informative scorer.
The base model has to be capable. A weak model produces samples whose average reward is too low for the gradient to find signal. gpt-oss-120b is large enough to occasionally hit high-reward candidates from cold-start prompting — RL then amplifies that occasional hit into a reliable mode of behavior. Trying TTT-Discover on top of a 7B model would likely fail not because the method is wrong but because RL needs something to reinforce.
Practical details that matter for reproducibility:
- Backbone: gpt-oss-120b, open-source. One fresh fine-tunable copy per problem.
- Optimizer & RL objective: standard policy-gradient setup with the rollout buffer as on-policy data; details follow recent open-source RL-for-LLM recipes.
- Steps per problem: on the order of 50 RL updates (the figure shows steps 0, 9, 24, 49); each step processes a batch of new rollouts.
- Sampling at each step: thousands of candidates per step in the engineering domains; fewer when each evaluation is expensive (biology).
- Wall-clock per problem: hours to a day on commodity GPU clusters.
- Dollar cost per problem: “a few hundred dollars” per the paper — an order of magnitude below the implied AlphaEvolve per-problem cost.
The cost asymmetry comes from two places. First, gpt-oss-120b on rented GPUs is much cheaper per token than frontier-API access to a model like Gemini 2.5 Pro. Second, RL is sample-efficient compared to evolutionary search — a gradient step uses every sample in the buffer; an evolutionary step uses only the survivors. The same compute spend extracts more signal.
Math Results: Erdős & Autocorrelation
数学结果:Erdős 与自相关
Who was Erdős, and what's the minimum overlap problem? (Click to expand)
Paul Erdős (1913–1996) was a Hungarian mathematician famous for his extraordinarily prolific output — over 1500 published papers with 500+ collaborators — and for the habit of posing open problems with small cash prizes attached to their solutions. The minimum overlap problem is one of his characteristic combinatorial questions, posed in 1955.
Informally: partition the integers from 1 to 2n into two equal-sized sets A and B. Now slide one of them — consider B shifted by every integer k. The two sets B+k and A will overlap in some number of points. The minimum overlap problem asks: how small can you make the largest such overlap, by choosing A and B cleverly? Formally, define M(n) = min over partitions of max over shifts k of |A ∩ (B + k)|. Erdős conjectured that the limiting density c = limn→∞ M(n)/n is strictly bounded away from 0 and from 1/4.
The problem is hard because it's extremal: even the most evenly-spread partitions have a "bad shift" that pushes the overlap up. No closed form for c is known; every published bound corresponds to an explicit numerical construction. Improving the bound is a long-running goalpost in additive combinatorics.
For a partition of \(\{1, \dots, 2n\}\) into two equal-sized sets \(A\) and \(B\), define \(M(n) = \min_A \max_k \vert A \cap (B + k) \vert\) — the smallest possible peak overlap as you slide \(B\) over \(A\). The Erdős minimum overlap conjecture asks for the value of \(c = \lim_n M(n)/n\). Improving either bound on \(c\) has been an open problem for decades; AlphaEvolve made the first numerical progress since 2016, and TTT-Discover advances it further.
| Method | Year | Best bound on \(c\) |
|---|---|---|
| Last published human result | 2016 | 0.380927 |
| AlphaEvolve | 2025 | 0.380924 |
| TTT-Discover | 2026 | 0.380876 |
The TTT-Discover number drops the bound by about \(5 \times 10^{-5}\) below AlphaEvolve — small in absolute terms but the kind of move that takes a competition-grade specialist a year of work. Both numbers correspond to explicit constructions (partitions of \(\{1, \dots, 2n\}\) found by the search), so the result is verifiable.
The autocorrelation inequality is a sibling problem: extremal autocorrelations of indicator functions of subsets of integers. AlphaEvolve had previously pushed it; TTT-Discover takes the same problem and improves further (concrete numbers are reported alongside the Erdős result in the paper’s tables). The methodological point of the math results is not that TTT-Discover solves either conjecture — neither is solved — but that on shared problems and shared compute budgets, the test-time-training lever beats the evolutionary-search lever. That comparison is cleaner than comparing a TTT system against a different method on different problems.
Engineering: GPU Kernels and AtCoder
工程:GPU Kernel 与 AtCoder
The most operationally impressive result is on the GPUMode TriMul kernel competition: write a CUDA/Triton kernel that computes a particular triangular matrix multiplication, matching reference correctness and beating the human leaderboard on wall-clock time. The reward signal is straightforward: correctness gate × time speedup. The same TTT-Discover run was repeated on four different accelerator generations:
| Hardware | Best human (μs) | TTT-Discover (μs) | Speedup |
|---|---|---|---|
| NVIDIA A100 | 4,531 | 2,198 | 2.06× |
| AMD MI300x | 2,462 | 1,596 | 1.54× |
| NVIDIA H100 | 1,371 | 1,161 | 1.18× |
| NVIDIA B200 | 1,005 | 905 | 1.11× |
Several things to read out of this table. First, the absolute speedup is larger on older / slower chips (A100, MI300x) and shrinks on newer ones (H100, B200) — the older chips leave more room for optimization that human contestants didn’t fully exhaust, while newer chips’ theoretical peak is closer to the human submissions’ wall-clock. Second, the cross-hardware consistency — every chip beaten, including AMD — suggests the method isn’t overfitting to one vendor’s quirks. Third, the human “best” entries are the result of organized public-competition work, not casual baselines, which makes the comparison meaningful.
What's AtCoder, and how does a Heuristic Contest differ from an algorithm contest? (Click to expand)
AtCoder is a Japanese competitive programming platform, comparable in scale and prestige to Codeforces. It runs two main contest formats with very different rules.
Algorithm contests are short sessions (2–5 hours) with problems graded pass / fail — your solution either matches the expected output on every hidden test case or it does not. They reward correctness and asymptotic efficiency.
Heuristic contests (AHC) are different. They run for one to two weeks, problems are NP-hard or otherwise admit no exact polynomial-time solution, and your score is a continuous value — e.g., total cost of a schedule, number of items packed, area covered. Top submissions are the product of dozens of incremental refinements over the contest window by an experienced competitive programmer. Beating the leader by even a small margin is therefore non-trivial: it requires inventing or implementing a heuristic that the field's best didn't try.
AtCoder Heuristic Contests (AHC) are long-form competitive programming problems where the score is continuous (not pass/fail) and the goal is to maximize it — closer in flavor to “construct a good solution” than to “find the algorithm.” TTT-Discover was run on past AHC contests with the contest’s own scoring used as reward.
| Contest | Domain | Best human score | TTT-Discover score |
|---|---|---|---|
| AHC39 | Geometry | 566,997 | 567,062 |
| AHC58 | Scheduling | 847,674,723 | 848,414,228 |
The margins are small but above the best human, which on contests of this scale means the top of several hundred competitive programmers, often with multiple submissions per contestant. Same method, same model, two very different domains.
Biology: Single-Cell RNA Denoising
生物学:单细胞 RNA 去噪
Single-cell RNA-seq, PBMC, Tabula, and the classical denoisers. (Click to expand)
Single-cell RNA sequencing (scRNA-seq) measures mRNA abundance for every gene in every individual cell. The output is a (cells × genes) count matrix where entry (i, j) is approximately the number of mRNA molecules from gene j detected in cell i. Typical experiments produce 10,000 to 100,000 cells by ~20,000 genes. The matrix is brutally sparse — each cell only has a few thousand non-zero entries — and is contaminated by technical artifacts (uneven capture efficiency, amplification bias, ambient mRNA from lysed cells).
Downstream analyses (cell-type clustering, trajectory inference, marker-gene discovery) are sensitive to that noise, so denoising is a critical preprocessing step. The widely-cited classical denoisers are MAGIC (Markov affinity-based graph imputation), DCA (deep count autoencoder), and scVI (single-cell variational inference); each trains a dedicated model on the dataset.
PBMC = peripheral blood mononuclear cells, a benchmark dataset of immune cells from blood. Tabula refers to the Tabula Sapiens / Tabula Muris cell atlases — large multi-organ scRNA-seq references widely used to evaluate denoising and integration methods.
The biology case study pushes the method farthest from its comfort zone. Single-cell RNA sequencing produces sparse, noisy gene-expression counts; classical denoising methods (MAGIC, DCA, scVI) require training a domain-specific model on the dataset. TTT-Discover instead asks gpt-oss-120b to write a denoising program — a Python script that takes the raw count matrix and emits a denoised one — and uses Pearson correlation against a held-out reference as the reward signal.
| Dataset | Best human method | TTT-Discover |
|---|---|---|
| PBMC | 0.64 | 0.71 |
| Tabula | 0.64 | 0.73 |
Two-to-nine points of Pearson correlation is a meaningful margin on these benchmarks. The interesting design choice is what TTT-Discover outputs: a program, not a denoised matrix. The model is trained to be good at writing denoisers for this dataset, not at denoising itself. That inheritance from FunSearch and AlphaEvolve — programs as the unit of discovery — keeps the artifact interpretable: a biologist can read the generated Python and slot it into a real pipeline. A model that outputted a matrix directly would be hard to trust the same way.
Where This Sits, and What Comes Next
这工作的位置与下一步
Place the three milestones on a 2×2:
| Single-domain (bespoke) | Open-domain (LLM-based) | |
|---|---|---|
| Frozen at inference | Pre-2020 supervised ML | FunSearch (2023), AlphaEvolve (2025), o1/R1-style scaling |
| Updated at inference | AlphaProof’s contest-time RL (2024, narrow) | TTT-Discover (2026) |
(For reference: AlphaProof is DeepMind’s Lean-based theorem-proving system that earned a silver-medal score at IMO 2024 by running its RL training loop during the contest itself — the one prior public example of inference-time weight updates for scientific discovery, but only in the narrow Lean-proof domain.)
TTT-Discover claims the previously empty bottom-right cell: a general-purpose discovery system whose LLM is trained on whatever single problem you point it at, using off-the-shelf open weights as the cold start. The math, engineering, and biology results are the proof that this quadrant is not just theoretically interesting but currently the strongest publicly-reported configuration on the problems where direct comparisons exist.
Four threads worth flagging, two practical and two conceptual.
Practical. Does cost stay flat as problems get harder? The reported few-hundred-dollar figure is presumably averaged over problems where RL did successfully climb the reward landscape. Frontier-difficulty problems may cost much more, or fail to converge at all. Can the trained per-problem weights be reused? The paper treats each problem in isolation by design, but the specialized weights are a real asset — a “fine-tuned for kernels” model could plausibly cold-start later kernel runs faster.
Conceptual. Where does the prior come from? TTT-Discover’s RL does not invent new mathematical machinery; it amplifies structural priors already inside gpt-oss-120b’s pretraining. As problems drift further from anything in the pretraining mix, the leverage RL provides over the base model presumably shrinks. Is “test-time training” still the right framing? When you train for hundreds of GPU-hours on a single problem, the line between “test-time” and “training-time” has blurred almost beyond recognition. The deeper claim is that the unit of training has shifted from a corpus to a problem — a thesis Yu Sun’s group has been building toward since 2020, and the strongest test of which is whether the next paper extends the trajectory.
Beyond the Four Domains: The Wider Benchmark Landscape
超越四个领域:更广的 Benchmark 全景
TTT-Discover targets four problems by design: an Erdős extremal-combinatorics conjecture, the GPUMode TriMul kernel competition, AtCoder Heuristic Contests, and single-cell RNA denoising on OpenProblems. Each was chosen for a fast deterministic verifier and a meaningful human baseline. Over the last 18 months, the LLM-agent community has built out a much larger benchmark ecosystem with the same shape — and any successor work in this line will be measured against the ecosystem rather than the hand-picked four. This section sketches the map.
A Four-Criteria Filter for "AI Scientist" Benchmarks
四标准过滤器:什么算"AI Scientist" Benchmark
Across the recent literature, benchmarks credibly framed as “AI scientist” or “research agent” evaluations tend to satisfy four properties:
- Open-ended. The agent picks among many viable solution paths; there is no single canonical answer to copy from training data.
- Ideation matters. Performance is bounded by what non-obvious approach the agent tries, not just execution speed. Hyperparameter tuning alone shouldn’t win.
- Multi-step environment. The agent interacts with a runnable artifact — file system, compiler, simulator, dataset. Reward comes from program execution, not string matching.
- Non-trivial rollout horizon. Each evaluation takes minutes to hours, which is where iterative search and learned approximate scorers start to pay off.
Several widely-cited benchmarks fail at least one criterion and are therefore measuring something different. SWE-bench Verified problems mostly have a single hidden correct patch (criterion 1 fails for most instances). FrontierMath and PutnamBench score in milliseconds against a single numerical or symbolic answer (criterion 4 fails). BrowseComp asks for a single short string (criterion 3 fails). These are not bad benchmarks — they just measure other things (closed-ended program repair, math-answer recall, web retrieval).
The Stratified Stack
分层栈
The benchmarks that do pass the filter organize, somewhat naturally, into tiers of increasing research demand:
| Tier | What the agent does | Representative benchmark | Year |
|---|---|---|---|
| ML engineering | Win a Kaggle competition | MLE-bench (75 Kaggle competitions, OpenAI) | 2024 |
| Research engineering | Beat human ML-RE workflows | RE-Bench (7 tasks, 71 human-expert attempts, METR) | 2024 |
| Methodological innovation | Beat baselines via new methods | MLRC-Bench (7 recent ML-conference contests) | 2025 |
| Long-horizon optimization | Hill-climb a scoring function for days | ALE-Bench (AtCoder Heuristic Contests, Sakana) | 2025 |
| GPU kernel optimization | Write a CUDA / Triton kernel that beats reference | KernelBench (Stanford, 250 PyTorch tasks) | 2024 |
| Full paper reproduction | Build a paper end-to-end | PaperBench (20 ICML 2024 papers, OpenAI) | 2025 |
| End-to-end paper generation | Pick topic, execute, write | AI-Scientist v2 (Sakana, open-ended) | 2025 |
Each tier raises the bar from “given a goal, can you reach it?” toward “can you set your own goal and reach it?” Current frontier LLMs perform passably at the engineering end and progressively worse moving up — o1-preview + AIDE reaches Kaggle bronze in 16.9% of MLE-bench competitions; the best agent on MLRC-Bench closes only 9.3% of the gap between baseline and top human. Performance has not yet saturated at any tier.
A second family sits one level deeper, more useful for training agents than for headline leaderboards: MLAgentBench (Stanford / Snap, 13 ML tasks), MLE-Dojo (a gym-style trainable extension of the MLE-bench distribution with explicit SFT/RL support), MLGym (Meta + NYU, 13 open-ended tasks across CV / NLP / RL / game theory), and DSBench (466 data analysis + 74 data modeling tasks from ModelOff + Kaggle).
The figure below plots ~40 of these benchmarks on a 2D map — domain on the x-axis, research intensity on the y-axis. Each marker is a benchmark; color encodes year. Hover for a one-line description; click to open the paper or repo.
ML & Research Engineering — Deep Dive
ML 与研究工程——细读
The most populated tier — every team has reproducible ML workflows to evaluate, so this is where benchmarking converged first.
-
MLE-bench (OpenAI, arXiv 2410.07095, Oct 2024) — 75 Kaggle competitions across vision, NLP, tabular, signal, biology. Agents train models, prepare datasets, submit predictions; graded against Kaggle’s actual leaderboard medal thresholds. Best published result: o1-preview + AIDE scaffolding reaches Kaggle bronze in 16.9% of competitions. Rollouts span hours to dozens of hours per competition. Note: OpenAI paused new leaderboard submissions on 24-Apr-2026 pending a v2 fairness overhaul; published o1-preview / AIDE numbers and the AIRA Toledo et al. follow-ups remain the comparable anchors. Code: github.com/openai/mle-bench.
-
RE-Bench (METR, arXiv 2411.15114, Nov 2024, ICML 2025) — 7 ML research-engineering environments: scaling-law fitting, GPU kernel optimization, LLM training task search, restricted MLP architecture search, etc. Designed with feedback from frontier-lab research engineers. 71 8-hour attempts by 61 distinct human experts open-sourced; agent transcripts at transcripts.metr.org. One AI agent wrote a faster custom Triton kernel than any of the human experts. 2h / 8h / 32h time-budget protocol. Among the richest trajectory datasets available. Authors request submitters keep solutions out of training data. Code: github.com/METR/RE-Bench.
-
PaperBench (OpenAI, ICML 2025) — reproduce 20 ICML 2024 Spotlight/Oral papers from scratch: codebase + experiments + result matching. 8,316 individually graded sub-tasks via author-co-developed rubrics + LLM judge (SimpleJudge). ML PhDs achieve 41.4%; Claude 3.5 Sonnet New reaches 21.0%. Full PaperBench costs roughly $4,000 USD per model on average; PaperBench Code-Dev variant is cheaper and GPU-free (~$2,000), though the paper notes Code-Dev correlates only “weakly” with full PaperBench. Code: github.com/openai/preparedness.
-
MLAgentBench (Stanford / Snap, arXiv 2310.03302, NeurIPS 2024) — 13 ML experimentation tasks: CIFAR-10 improvement, BabyLM, Kaggle challenges. ReAct-style scaffold with file ops, code execution, output inspection. Best agent (Claude 3 Opus) hits 37.5% average success. Foundational predecessor to MLE-bench / MLRC-Bench. Some tasks (CIFAR-10) now saturated. Code: github.com/snap-stanford/MLAgentBench.
-
MLRC-Bench (arXiv 2504.09702, Apr 2025, NeurIPS 2025) — 7 ML research competition tasks from recent ML conferences (LLM merging, machine unlearning, perception, few-shot). Explicitly designed to require novel methodological innovation beyond hyperparameter tuning. Best agent (gemini-exp-1206 under MLAB scaffold) closes only 9.3% of the gap between baseline and top human — currently the strongest measure of how far frontier LLMs are from doing methodological research. Designed as a “dynamic” benchmark that incorporates new competitions. Code: github.com/yunx-z/MLRC-Bench; HF Spaces leaderboard.
-
MLE-Dojo (arXiv 2505.07782, May 2025) — Gym-style interactive environment built on 200+ Kaggle challenges with a dedicated training set of 150 tasks. Explicitly supports SFT and RL — the trainable counterpart to MLE-bench. Sampling trajectories provided for both. Heavy overlap with MLE-bench’s Kaggle distribution, so most teams pick one or the other depending on whether they need trainable RL trajectories (MLE-Dojo) or industry-standard leaderboard comparability (MLE-bench). Code: github.com/MLE-Dojo/MLE-Dojo.
-
MLGym / MLGym-Bench (Meta + NYU, arXiv 2502.14499, Feb 2025) — Gym-style ML research environment with 13 open-ended tasks spanning CV, NLP, RL, game theory. Baselines from Claude 3.5 Sonnet, Llama-3.1-405B, GPT-4o, o1-preview, Gemini-1.5 Pro. CC-BY-NC 4.0 license — limits commercial deployment. Authors note current frontier models mostly tune hyperparameters in practice rather than ideating. Code: github.com/facebookresearch/MLGym.
-
AI-Scientist v1 & v2 (Sakana, v1 arXiv 2408.06292, 2024; v2 arXiv 2504.08066, 2025). v1 generates end-to-end ML papers on nanoGPT, 2D diffusion, grokking templates. v2 removes templates and uses progressive agentic tree search; one v2-generated paper was accepted at the ICBINB workshop at ICLR 2025 (then voluntarily withdrawn per a pre-agreement). Per ICBINB organizers’ ICLR 2026 proposal, past editions accepted roughly 35 of 40 submissions (~87.5%), far above main-conference rates — treat the “first AI-generated peer-reviewed paper” headline accordingly. ~$15–25 per full paper. Caveat: this is an agent system, not a benchmark with standardized scoring; evaluation requires LLM-as-judge or human review. Code: github.com/SakanaAI/AI-Scientist (~13.8k stars) and github.com/SakanaAI/AI-Scientist-v2 (~6.4k stars).
-
DSBench (ICLR 2025, arXiv 2409.07703) — 466 data-analysis tasks + 74 data-modeling tasks from ModelOff and Kaggle. Long contexts, multimodal data, multi-table reasoning. Best agent solves 34.12% of analysis tasks. More “analyst” than “scientist,” but a useful adjacent reference. Code: github.com/LiqiangJing/DSBench.
Scientific Discovery & Reproduction — Deep Dive
科学发现与复现——细读
The “scientist-like” tier — domain-grounded tasks closer to actual research workflows than to ML engineering.
-
TTT-Discover’s open-sourced environments (arXiv 2601.16175, Jan 2026) — the four domains the rest of this post analyzed are themselves fully open-sourced: Erdős minimum-overlap & autocorrelation inequalities (closed-form continuous reward; AlphaEvolve set 0.380924, TTT-Discover improved to 0.380876); GPUMode TriMul kernel competition (H100/A100 timing as reward; TTT-Discover 1161μs vs human 1371μs on H100); AtCoder Heuristic Contests (overlaps ALE-Bench; 1st place in ahc039); single-cell denoising on OpenProblems (training on Pancreas, evaluation held out on PBMC and Tabula Muris Senis Lung; Poisson / MSE score). Code: github.com/test-time-training/discover with sandbox reward evaluators and ALE-Bench / GPUMode integration.
-
DiscoveryBench (AI2, arXiv 2407.01725, Jul 2024) — 264 real data-driven discovery tasks from 6 domains (sociology, biology, humanities, economics, engineering, meta-science) + 903 synthetic tasks. Each task = dataset + metadata + natural-language discovery goal. Hypotheses are constrained by available columns — “open-ended” but bounded by the schema. 100+ citations by mid-2025; foundational AI2 reference. Code: github.com/allenai/discoverybench; HF dataset.
-
ScienceAgentBench (OSU, arXiv 2410.05080, ICLR 2025) — 102 tasks from 44 peer-reviewed publications in bioinformatics, computational chemistry, GIS, cognitive neuroscience. Agent must produce a self-contained Python program. Best agentic framework solves ~32.4% (rising to 34.3% with expert knowledge). Code: github.com/OSU-NLP-Group/ScienceAgentBench; HAL leaderboard at hal.cs.princeton.edu.
-
BixBench (FutureHouse, arXiv 2503.00096, Mar 2025) — 53 real-world bioinformatics scenarios with 296 guiding research questions, paired with Jupyter notebooks and heterogeneous input data. Agents execute Python/R/Bash in notebooks; open-ended analytical workflows. Adopted by Biomni Lab, Edison Analysis, and multiple agent providers as of 2026. Code: github.com/Future-House/BixBench; HF dataset.
-
LAB-Bench (FutureHouse, 2024) — 2,400+ multiple-choice questions on practical biology research capabilities. MCQ format limits its use as an open-ended research-agent eval, but a reference for “does the model know the standard bio workflows.”
-
CORE-Bench (Princeton, arXiv 2409.11363, TMLR 2025) — 270 tasks from 90 papers across CS / social science / medicine focused on computational reproducibility (rerun existing code/data). 3 difficulty levels; best agent: 21% on hardest level. Narrower than PaperBench but more grounded.
-
SUPER (Allen AI, arXiv 2409.07440, EMNLP 2024) — 45 expert end-to-end + 152 sub-problems + 602 auto-generated problems on setting up and executing research repos. GPT-4o solves 16.3% e2e / 46.1% sub-problems. Gold solutions and baseline trajectories at github.com/allenai/super-benchmark.
-
RExBench (Tin Lab, arXiv 2506.22598, Jun 2025) — 12 realistic research-extension tasks (NLP/ML), each built on existing papers with domain-expert-written extension instructions. Best agent: ~33% without hints, ~44% with hints. Extensions are given, not discovered. Code: github.com/tinlaboratory/RExBench (gated HF dataset).
Code, Kernels, and Software Engineering — Deep Dive
代码、Kernel 与软件工程——细读
-
KernelBench (Stanford, arXiv 2502.10517, Dec 2024, ICML 2025) — 250 PyTorch ML workload tasks across 4 difficulty levels (single ops, fused ops, full models, HuggingFace model optimization). Metric:
fast_p= correct AND >p× faster than PyTorch. Native single-turn but explicitly supports iterative refinement with execution + profiler feedback; DeepSeek R1 with 10 turns reaches 43% / 72% / 18% across Levels 1/2/3. New kernelbench.com v3 (multi-GPU) and “Hard” (Blackwell) variants released Q1/Q2 2026; KernelBench-X extension exists; follow-ups include Meta’s KernelEvolve, AMD’s GEAK Triton agents, Sakana’s reproductions. Best-paper at DL4C @ ICLR 2025; ICML 2025. The field’s central GPU-kernel-LLM benchmark. Code: github.com/ScalingIntelligence/KernelBench. -
ALE-Bench (Sakana, arXiv 2506.09050, Jun 2025) — past AtCoder Heuristic Contest tasks: score-based NP-hard optimization (routing, scheduling, packing). The only major benchmark explicitly designed for objective-driven, score-based, long-horizon algorithmic engineering (contests run for one to two weeks). Supports interactive agents with test-run feedback and visualizations. ALE-Agent ships as reference baseline. Backbone for TTT-Discover’s algorithm domain. Code: github.com/SakanaAI/ALE-Bench; HF dataset at hf.co/datasets/SakanaAI/ALE-Bench.
-
SWE-bench Verified / Pro / EVO — the repository-bug-fix family. SWE-bench Verified is 500 human-filtered SWE-bench instances; mostly closed-ended (hidden unit tests) with 1–2 line fixes for 161 of 500 problems. SWE-Bench Pro (Sep 2025) is 1,865 problems from 41 enterprise-style repos with long-horizon, multi-file changes (avg 21 files modified). SWE-EVO (Dec 2025) provides 48 release-note-based tasks with avg 874 tests per instance. All three measure software engineering capability but a single canonical patch is hidden behind tests — doesn’t satisfy “open-ended ideation.” Useful adjacent baselines.
-
SWE-Gym (Berkeley, Dec 2024, ICML 2025) — 2,438 real-world Python tasks with executable runtime, unit tests, NL specs. Designed for training, not just evaluating. The entire fine-tuned model + trajectory dataset are open. 32.0% on SWE-Bench Verified with a 32B model. R2E-Gym extension (arXiv 2504.07164) provides 8.1K procedurally-generated environments for even more trajectory data — strong for trajectory pretraining, weak on ideation.
-
GSO (arXiv 2505.23671, May 2025, NeurIPS 2025 D&B) — 102 software optimization tasks mined from real commits in NumPy, Pandas, Pillow, llama.cpp, etc. across 5 languages. Agent gets codebase + perf test, must beat expert dev optimization. “Hack Detector” anti-gaming. Frontier SWE-agents get <5% OPT@1. Agent trajectories explicitly released at gso-bench.github.io. Closest cousin of KernelBench in spirit.
-
Commit0 (Cornell + Cohere, arXiv 2412.01769, Dec 2024) — 54 entire Python libraries built from scratch from API specs + interactive unit tests. Best agent: 29.30% subset / 6.12% full.
-
Aider Polyglot (Dec 2024) — 225 hard Exercism problems in 6 languages. Light multi-step: two attempts with unit-test feedback. Closed-form puzzles; useful for “does the model code?” but not for ideation-bound evaluation.
-
SWE-Lancer (OpenAI, 2025) — freelance software-engineering tasks scored by code review / payout. Less mature trajectory ecosystem than the SWE-bench family.
Math, Theorem Proving, and Formal Reasoning — Deep Dive
数学、定理证明、形式推理——细读
-
FrontierMath (Epoch AI, 2024) — per Epoch AI’s benchmark page: “The full FrontierMath dataset contains 350 problems. This is split into a base set of 300 problems, which we call Tiers 1–3, and an expansion set of 50 exceptionally difficult problems, which we call Tier 4.” Single-answer (numerical / symbolic match), sub-second eval. Excellent capability signal but doesn’t reward search.
-
PutnamBench (arXiv 2407.11214) — 640 Putnam problems formalized in Lean 4, Isabelle, and Coq (417 in Coq). A proof attempt is multi-step in the prover but eval is binary pass/fail with fast verification.
-
MiniF2F — 488 olympiad-level problems formalized across Lean / Isabelle / HOL Light / Metamath. The longest-standing formal math benchmark.
-
LeanDojo / Lean Workbook — end-to-end Lean 4 environment with retrieval; Lean Workbook has ~57K formal-informal problem pairs. Better positioned than PutnamBench for open-ended search because the environment is multi-step at tactic level, though eval is still cheap.
-
Erdős-style open math problems (via TTT-Discover) — the autocorrelation inequality and minimum-overlap problems used by TTT-Discover are the only “open-ended math” tasks that genuinely satisfy continuous reward + ideation matters + multi-step search + recognized math identity. As a benchmark suite they exist only as TTT-Discover’s sandbox.
Adjacent Ecosystems: Paper Writing, Web Agents, Games, Evolutionary Reproductions
相邻生态:论文写作、Web Agent、游戏、演化搜索复现
Paper writing. ResearchBench (Shanghai AI Lab, Mar 2025) covers inspiration retrieval + hypothesis composition + hypothesis ranking over 1,386 papers in 12 disciplines — but it’s a static text benchmark, no multi-step environment. ReportBench (ByteDance) is survey-writing eval via citation overlap + statement validation — same disqualifier. The community-run ML Reproducibility Challenge is annual but with informal eval. AI-Scientist v2 generates papers but evaluation falls back to LLM-as-judge or workshop submission.
Web agents. WebArena, VisualWebArena, WorkArena, Mind2Web provide multi-step browser environments with binary task-completion rewards — useful adjacent baselines if web research is in scope. BrowseComp (OpenAI, Apr 2025) is 1,266 hard short-answer web research questions — single-answer format, sparse reward; OpenAI Deep Research already saturates the easy portion. BrowseComp-Plus (ACL 2026) adds a fixed corpus with human-verified supporting docs; better-controlled but still single-answer.
Game-based RL environments. NetHack via NLE, Crafter, MineDojo, Voyager; BALROG (ICLR 2025) aggregates several. Craftax (JAX, multi-orders-of-magnitude speedup) supports fast iteration. These satisfy “open-ended” and “long-horizon” but the broader ML community frames them as embodied / long-horizon planning testbeds, not “AI scientist” benchmarks. Including them in a research-agent portfolio dilutes the AI-for-science narrative.
Robotics curricula. RoboHive, RLBench, LIBERO are well-developed for robot learning but not “AI scientist” benchmarks.
Evolutionary-search reproductions. AlphaEvolve is closed source. OpenEvolve and CodeEvolve reproduce its template — use these if you want AlphaEvolve-style environments. AlphaEvolve’s reported numbers — per Google Cloud’s “AlphaEvolve on Google Cloud” blog, “AlphaEvolve sped up a vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini’s training time” — are DeepMind-reported and not independently verified. FunSearch domains (cap-set, online bin packing on OR-Library OR1–OR4 + Weibull 5K, cyclic graphs) are public at github.com/google-deepmind/funsearch but the repository ships code, not a bespoke benchmark suite.
Optimization-flavored. OPT-BENCH (arXiv 2506.10764, Jun 2025) is 20 Kaggle ML tasks + 10 NP-hard combinatorial problems with multi-step interaction — moderate fit. OptiBench is single-shot LP/MILP modeling.
Hardware / systems optimization beyond kernels. No widely-accepted benchmark; KernelBench and GSO are the closest the community currently has.
Materials, chemistry, drug discovery. No public benchmark currently matches all four criteria. ChemBench exists as a knowledge / MCQ eval. The most credible attempts at agentic discovery in these areas remain proprietary (Anthropic’s microbiology results, Isomorphic Labs).
TTT-Discover's Four Domains, Mapped
TTT-Discover 四个领域对照公开 Benchmark
To reproduce TTT-Discover or compare against it, each of its four domains has a corresponding community-recognized benchmark:
| TTT-Discover domain | Public benchmark | Notes |
|---|---|---|
| Erdős minimum overlap / autocorrelation | TTT-Discover’s open sandbox at github.com/test-time-training/discover; AlphaEvolve-style reproductions in OpenEvolve and CodeEvolve | No standalone community benchmark; the problems themselves are recognized math |
| GPUMode TriMul kernel | KernelBench (250 PyTorch tasks across 4 difficulty levels, fast_p metric) | TTT-Discover used GPUMode directly; KernelBench is the community-standard generalization |
| AtCoder Heuristic Contests | ALE-Bench (past AHC tasks, score-based continuous reward, ALE-Agent reference baseline) | TTT-Discover used AHC directly; ALE-Bench is the now-standard wrapper |
| Single-cell RNA denoising | OpenProblems (single-cell benchmarking framework); BixBench (53 bioinformatics scenarios at notebook-level) | OpenProblems is the underlying dataset; BixBench is the agent-task wrapper |
The pattern is consistent: TTT-Discover picked the underlying problem source for each domain (GPUMode, AtCoder, OpenProblems) but did not adopt the benchmark wrappers the community has since standardized on (KernelBench, ALE-Bench, BixBench). A successor system can plug into the wrappers directly.
Two adjacent ecosystems share the same open-ended-plus-multi-step-plus-executable shape but extend in different directions:
- Data-driven scientific discovery. DiscoveryBench (AI2; 264 real data-driven discovery tasks across sociology, biology, humanities, economics, engineering, meta-science), ScienceAgentBench (OSU; 102 tasks from 44 peer-reviewed publications in bioinformatics, comp. chemistry, GIS, cognitive neuroscience), and BixBench push toward analytical workflows rather than just engineering.
- Software optimization. GSO (102 software optimization tasks mined from real commits in NumPy, Pandas, Pillow, llama.cpp across 5 languages) is the closest cousin of KernelBench in spirit. The SWE-bench / SWE-Gym / R2E-Gym family handles repository-level software engineering, though most instances of SWE-bench Verified are closed-ended bug fixes.
What's Still Missing
仍然缺什么
Three notable gaps as of mid-2026:
- No materials / chemistry / drug-discovery benchmark satisfies all four criteria. ChemBench is multiple-choice. The most serious “agentic discovery” efforts in these areas are either proprietary (Anthropic’s microbiology work, Isomorphic Labs) or require wet-lab experiments — currently out of reach for pure-software agents.
- No paper-writing benchmark with closed-loop scoring. ResearchBench (1,386 papers across 12 disciplines) and ReportBench measure surface properties — citation overlap, hypothesis ranking, statement validation — but don’t reward end-to-end execution. AI-Scientist v2 generates papers but evaluation falls back to LLM-as-judge or workshop submission, both noisy and inflated by workshop acceptance rates.
- Game-based environments — NetHack, Crafter, MineDojo, Voyager, BALROG — are open-ended and long-horizon but the community does not treat them as “AI scientist” benchmarks. They sit cleanly on the “embodied / long-horizon planning testbed” branch instead.
The takeaway: TTT-Discover’s four-task hand-pick was effective at proving the method, but it’s one slice of a much wider landscape. The same RL-at-inference loop should plug, with minimal changes, into MLE-bench, RE-Bench, KernelBench, ALE-Bench, MLRC-Bench, and PaperBench Code-Dev. The next paper in this line will likely report on exactly that — and the comparison against frontier models will move with it.