Evolution Fine‑Tuning
Learning to Discover Across 371 Optimization Tasks

A mid-training “practice phase” that teaches small open‑source LLMs how to evolve solutions.

1University of Minnesota   2Carnegie Mellon University   3KAIST   4University of Cambridge   5Hanyang University   6Amazon

*This work is independent of the author’s position at Amazon and does not relate to any work conducted at Amazon.

News
TL;DR — Evolution Fine-Tuning (EFT) converts evolutionary search trajectories into supervision, giving small open-source LLMs a practice phase that teaches them how to evolve solutions before they ever see a new problem. Trained on the 156K-trajectory Finch Collection, our Finch models generalize discovery skill across 22 held-out tasks (+10.22% over base), compose strategies across domains, and reach state-of-the-art on circle-packing when paired with test-time RL.
EFT teaser: mid-training and cross-discovery transfer
EFT acts as mid-training. Finch lifts discovery on the Erdős minimum-overlap problem under both test-time search and learning (left); on NP-hard competitive programming it composes strategies across domains, while the base model repeats a single one (right).
156K
filtered trajectories
371
tasks · 10 domains
+10.2%
avg. gain on 22 held-out
2–9B
Finch model sizes
Abstract

Would designing faster GPU kernels help close in on an open math conjecture?

LLMs integrated into evolutionary search have recently produced state-of-the-art solutions on optimization tasks — open mathematical conjectures, GPU kernel design, scientific-law discovery, and combinatorial puzzles. To achieve this, prior work applies a search scaffold to one target task at a time, so every new problem is approached from scratch and the experience accumulated during search is discarded once the model finishes. This leaves the capability of iteratively evolving a solution — knowing which part to mutate and how, deciding when to backtrack — entirely in the scaffold rather than in the model itself.

We introduce Evolution Fine-Tuning (EFT), a mid-training paradigm that teaches LLMs to evolve solutions across tasks by converting evolutionary search trajectories into supervision. We construct the Finch Collection, a 156K-trajectory dataset spanning 10 domains and 371 optimization tasks, and fine-tune open-source LLMs from 2B to 9B parameters. Empirically, EFT confers cross-task generalization: across 22 held-out tasks, our models surpass their base counterparts by 10.22% on average. Furthermore, when paired with test-time RL, our model matches state-of-the-art performance on two circle-packing tasks and outperforms its base-model counterparts on the Erdős minimum-overlap problem. EFT thus serves as a “practice phase” for general-purpose discovery agents that doesn’t solve new problems from scratch.

The idea

Move discovery skill out of the scaffold and into the model

Test-time search needs an expensive proprietary mutation operator; test-time learning over-fits a single task and throws the strategy away. EFT distills the discovery behavior itself into a small model — which then plugs into either scaffold.

Concept of Evolution Fine-Tuning
The EFT idea in one picture. Rather than expensive prompting (test-time search) or single-task RL (test-time learning), EFT distills discovery tasks into the model — producing Finch, which then works inside either scaffold with frozen weights or further adaptation.

A mid-training “practice phase”

Instead of rebuilding discovery skill from scratch inside every search run, EFT teaches the LLM how to mutate, what to keep, and when to backtrack — so it practices before deployment rather than solving each new problem from zero.

Trajectories as supervision

Optimization tasks are NP-hard and lack ground-truth optima, so (problem, answer) pairs are unavailable. EFT instead treats the trajectories of search runs — parent → child transitions with scores — as the training signal.

Orthogonal to the scaffold

An EFT model can serve as a frozen mutation operator inside test-time search, or be further adapted by test-time RL. EFT is a layer beneath both branches, not a replacement for either.

Emergent cross-domain transfer

Because it is trained across many domains at once, Finch composes strategies it learned elsewhere when tackling a new problem — behavior the base model never exhibits.

The Finch Collection

156K evolutionary trajectories, 371 tasks, 10 domains

Optimization training data is hard to synthesize, so we source 371 seed tasks from 10 existing benchmarks — each requiring nontrivial search, with a deterministic continuous-score evaluator — and harvest the search itself.

Finch Collection construction pipeline
The construction pipeline. Collect seed optimization tasks → run an evolutionary scaffold (OpenEvolve) with a strong teacher mutation operator to harvest parent-to-child trajectories → filter out broken, systematic-error, and overlong cases, yielding ~156K trajectories over 371 tasks.
1

Seed task collection

371 tasks sourced from 10 benchmarks — chosen to require real search, not ground-truth matching, with deterministic scorers.

2

Trajectory collection

OpenEvolve with a Qwen3.5-397B-A17B teacher runs each task under diff-edit and full-rewrite strategies → 172,997 raw trajectories.

3

Filtering & labeling

Remove systematic errors, hard-negative breakages, and overlong inputs (90.6% retained), then label each by score delta.

Task distribution across domains
371 tasks across 10 domains. Bubble size shows each group’s task count, led by competitive programming and numerical algorithm optimization.
Improvement distribution of trajectories
Improvement breakdown. 39.4% of trajectories improve the parent, 19.2% leave it unchanged, 41.3% regress — supplying both imitation and preference (good-vs-bad) signal.
Competitive Programming 172 Numerical Algorithm Opt. 47 SR · Physics Oscillation 44 Heuristic Optimization 35 Mathematical Discovery 28 SR · Bio Pop. Growth 24 SR · Chem Reaction 12 GPU Kernel Optimization 4 scRNA-seq Denoising 3 Constructive Search 2

The collection is balanced across languages (68.5% Python / 31.5% C++) and strategies (50.3% diff-edit / 49.7% full-rewrite). We fine-tune the Qwen3.5 (2B/4B/9B) and Qwen3-8B bases via full SFT on improved trajectories from 355 tasks (16 held out), producing the Finch family.

Results

EFT confers cross-task discovery generalization

Used as a mutation operator inside test-time search, Finch beats its base models across 22 held-out tasks — and lets small models rival non-EFT models twice their size.

+10.22%

Average held-out gain

Finch over same-size base models across 22 held-out tasks, with up to +290% on ahc058 and +74% on Transaction.

2× size

Punching above its weight

Finch-4B reaches 0.3865 on Erdős — comparable to Qwen3-8B’s 0.4036 (lower is better), matching a model twice as large.

+14.1%

Positive task-scaling

Held-out performance rises steadily as the Finch Collection grows from 15 → 355 training tasks.

Held-out · OpenEvolve

Main results: test-time search

Finch vs. same-size base · 11 held-out metrics
Mathematical Discovery Algorithm Eng. System Performance
Model Erdős↓AC1↓AC2↑CP(26)↑Hadamard↑ ahc039↑ahc058↑ EPLB↑PRISM↑LLM-SQL↑Transaction↑ Avg.Δ↑
Best Human0.3809271.50970.90152.6340000.935673566,997847,674,7230.126521.890.69202724.80
Initial Program0.4950561.51860.85580.9597640.143275534,85000.126521.890.68562824.86
OpenEvolve + Proprietary Models
Claude-Opus-4.60.3818802.6293000.127026.260.71603774.00
Gemini-3-Pro2.5414000.127226.240.72584273.50
GPT-52.5414000.127226.230.71554237.30
OpenEvolve + Open-source Models
Qwen3.5-2B0.3817371.51860.86461.2530560.478009546,07800.126521.890.68562832.86
Finch-2B0.3813461.51840.89201.5351340.400476545,256329,359,2530.126922.260.68602949.85
Δ+0.10%+0.01%+3.17%+22.51%-16.22%-0.15%+0.32%+1.69%+0.06%+4.13%+1.56%
Qwen3.5-4B0.4169241.51860.88021.6807870.384332542,07700.126621.890.68562732.24
Finch-4B0.3864601.51730.89331.8068080.146199551,844331,466,8830.126722.870.68574761.90
Δ+7.31%+0.09%+1.49%+7.50%-61.96%+1.80%+0.08%+4.48%+0.01%+74.30%+3.40%
Qwen3-8B0.4035851.51770.89801.7975760.452330557,08100.126923.810.68583174.60
Finch-8B0.3812361.51540.90011.8226170.501743557,168135,184,6840.127024.700.73413257.33
Δ+5.54%+0.15%+0.23%+1.39%+10.92%+0.02%+0.08%+3.74%+7.04%+2.61%+3.17%
Qwen3.5-9B0.3855121.51860.88011.1727020.397184553,582134,486,7000.126922.360.68583584.23
Finch-9B0.3811001.51410.91221.9360000.480585553,759525,286,8960.126523.930.70243636.36
Δ+1.14%+0.30%+3.65%+65.09%+21.00%+0.03%+290.59%-0.32%+7.02%+2.42%+1.45%+10.24%

Δ is the relative improvement of Finch over its same-size base, sign-adjusted so positive always means better. Avg. Δ averages the available metrics (ahc058 excluded — its near-zero base inflates the ratio). Finch lifts the average at every scale, up to +10.24% at 9B, and matches strong proprietary operators (Claude-Opus-4.6, Gemini-3-Pro, GPT-5) on several metrics with a far smaller open backbone.

FrontierCS

NP-hard competitive programming

6 UC Berkeley contest problems
ModelP263↑P301↑P302↑P303↑P304↑P305↑Avg.↑
Qwen3.5-2B0.000.000.000.000.000.000.00
Finch-2B0.381.630.123.160.0031.436.12
Qwen3.5-4B8.1520.9927.070.000.0030.8914.52
Finch-4B27.4468.1724.4131.790.0040.0331.97
Qwen3-8B23.7244.6736.7810.840.0029.2324.21
Finch-8B38.1239.4136.6810.840.0022.3424.56
Qwen3.5-9B55.0927.5935.6335.545.8135.1432.46
Finch-9B86.1058.7836.6834.0222.1138.3846.01

Average score across six NP-hard FrontierCS contest problems. Finch beats its base at every size, reaching 46.01 at 9B (vs. 32.46) — and lets Finch-4B (31.97) outscore the 2×-larger Qwen3-8B base (24.21).

KTO

Offline RL

preference learning on improved + regressed
ModelErdős↓AC1↓AC2↑CP↑
Best Human0.3809271.50970.9015
Qwen3.5-4B0.4169241.51860.880214.52
Finch-4B0.3864601.51730.893331.97
Finch-4B + KTO0.3818091.51510.912136.30
Qwen3-8B0.4035851.51770.898024.21
Finch-8B0.3812361.51540.900124.56
Finch-8B + KTO0.3815961.50890.914637.30

Further training Finch on improved + regressed trajectories with KTO teaches it to tell good solutions from bad. Finch-8B + KTO surpasses the best human score on AC1 (1.5089) and AC2 (0.9146), and KTO lifts competitive-programming scores to 36–37.

nanodiscover

Online RL (test-time RL)

circle-packing + Erdős
ScaffoldModelErdős↓CP (n=26)↑CP (n=32)↑
ThetaEvolveR1-Qwen3-8B2.635983
TTT-DiscoverGPT-OSS-120B0.380876
Qwen3-8B0.3809322.6359832.939572
nanodiscoverQwen3-8B0.3809562.6359832.939573
Finch-8B0.3809482.6359832.939573

As the policy inside test-time RL (nanodiscover), Finch-8B matches state-of-the-art on both circle-packing tasks (n=26 & n=32) and edges out the Qwen3-8B base on the Erdős problem — EFT serves as mid-training that strengthens what online RL reaches.

Ablation

Scaling effect

15 → 355 training tasks
Scaling with number of training tasks
Positive task-scaling. As the Finch Collection grows from 15 to 355 training tasks, held-out performance rises monotonically on AC2, CP, and PRISM — an average +14.1% improvement, evidence that EFT gains come from task diversity rather than any single task.
Acknowledgement

This research was supported by the “Advanced GPU Utilization Support Program” funded by the Government of the Republic of Korea (Ministry of Science and ICT).

We are grateful to the SkyDiscover team for their valuable feedback on the dataset construction process, the use of the SkyDiscover framework, and the overall direction of this research — in particular, Shu Liu, Shubham Agarwal, and Mert Cemri for their insightful comments and discussions. We also thank the OpenEvolve team, especially Ritik Vijayvergiya and Asankhaya, for their guidance on using the OpenEvolve framework and for their thoughtful comments on this work.

We thank the authors of ALE-Bench, especially Yuki Imajuku, and the AtCoder team for authorizing the public release of the evolutionary search trajectories derived from their CC BY-ND 4.0 licensed dataset.

We further thank Byung‑Kwan Lee for valuable feedback during the early stages of this project.

Citation

BibTeX

If you find Evolution Fine-Tuning or the Finch Collection useful, please cite our work. (arXiv id is added once the preprint is posted.)

@misc{lee2026evolution,
  title        = {Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks},
  author       = {Lee, Young-Jun and Kim, Seungone and Kang, Minki and Cheong, Alistair
                  and Chen, Zerui and Han, Seungho and Jung, Taehee and Kang, Dongyeop},
  year         = {2026},
  eprint       = {ARXIV_ID_PLACEHOLDER},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/ARXIV_ID_PLACEHOLDER}
}