FrontierCS releases FrontierSmith for open-ended coding data

VIEWS19KBOOKMARKS103LIKES111REPLIES7

There is a very interesting idea in this paper: how to judge if an optimization problem created by an LLM is ‘interesting’ or ‘valuable’ ? The proposed measure is called *idea divergence* : asks llms to solve the task multiple times and measures how many different strategies are used and perform well. We could not measure such solution diversity objectively before LLMs, but now we can easily get it with prompting.

Qiuyang Mang@MangQiuyang

Open-ended coding training data may no longer be the bottleneck: AI can scale open-ended tasks—and even outperform human-expert curation.

FrontierCS team is releasing FrontierSmith: a system for synthesizing open-ended coding problems at scale. Starting from closed-ended coding tasks, FrontierSmith mutates, filters, and builds runnable optimization environments for long-horizon coding agents. In our experiments, FrontierSmith data trains stronger models than human-curated open-ended data on FrontierCS and ALE-bench.

Blog: https://frontier-cs.org/blog/frontiersmith/ Paper: https://arxiv.org/abs/2605.14445 Code: https://github.com/FrontierCS/FrontierSmith Model: https://huggingface.co/runyuanhe/qwen35-9b-frontiersmith

45d19K111103

RETWEETS36

Qiuyang Mang@MangQiuyang

Open-ended coding training data may no longer be the bottleneck: AI can scale open-ended tasks—and even outperform human-expert curation.

FrontierCS team is releasing FrontierSmith: a system for synthesizing open-ended coding problems at scale. Starting from closed-ended coding tasks, FrontierSmith mutates, filters, and builds runnable optimization environments for long-horizon coding agents. In our experiments, FrontierSmith data trains stronger models than human-curated open-ended data on FrontierCS and ALE-bench.

Blog: https://frontier-cs.org/blog/frontiersmith/ Paper: https://arxiv.org/abs/2605.14445 Code: https://github.com/FrontierCS/FrontierSmith Model: https://huggingface.co/runyuanhe/qwen35-9b-frontiersmith

45d61.2K276295

Ofir Press@OfirPress

Love the name :)

Qiuyang Mang@MangQiuyang

Open-ended coding training data may no longer be the bottleneck: AI can scale open-ended tasks—and even outperform human-expert curation.

FrontierCS team is releasing FrontierSmith: a system for synthesizing open-ended coding problems at scale. Starting from closed-ended coding tasks, FrontierSmith mutates, filters, and builds runnable optimization environments for long-horizon coding agents. In our experiments, FrontierSmith data trains stronger models than human-curated open-ended data on FrontierCS and ALE-bench.

Blog: https://frontier-cs.org/blog/frontiersmith/ Paper: https://arxiv.org/abs/2605.14445 Code: https://github.com/FrontierCS/FrontierSmith Model: https://huggingface.co/runyuanhe/qwen35-9b-frontiersmith

45d7.4K2313

Qiuyang Mang@MangQiuyang

Huge thanks to all collaborators: @RunyuanHe @Zhoushang @KaiyuanLiu04 @lihanc02 @HuanzhiMao @qizhengz_alex Zerui Li @Builtin_Pb Lufeng Cheng @YichuanM @shangjingbo @AlexGDimakis @profjoeyg @alvinkcheung

45d865

Qiuyang Mang@MangQiuyang

Code agents are getting very good at repetitive software work. Across research labs and startups, the next important question is increasingly about whether AI can solve open-ended optimization problems that matter in the real world: chip placement and routing, logistics, power-grid scheduling, database tuning, kernel optimization, and many others.

But our previous FrontierCS on Harbor blog (https://frontier-cs.org/blog/harbor/) showed a clear weakness. Today's code agents are much less reliable in long-horizon, open-ended optimization than they are on traditional contest or math-style tasks.

45d2012

Qiuyang Mang@MangQiuyang

We think a major reason is data.

Classic RLVR settings have huge amounts of high-quality training data. Competitive programming alone has more than 100,000 public problems, and the broader coding-data industry is continuously producing more. By contrast, if we add together open-ended optimization benchmarks such as FrontierCS, ALE-bench, KernelBench, and the recent MLS-Bench, we still only get hundreds of tasks.

That gap is the bottleneck FrontierSmith targets. Frontier labs may already understand the value of open-ended optimization, but without enough scalable training tasks, it is hard to run the kind of training that made closed-ended coding models so strong.

45d1401

Kexun Zhang@kexun_zhang

@MangQiuyang lol thanks for featuring hardtests in your video haha

45d332

Qiuyang Mang@MangQiuyang

The core idea of FrontierSmith is simple: do not ask an LLM to invent high-quality open-ended problems from scratch. Start from closed-ended problems instead.

Closed-ended coding tasks are already abundant. Given a LeetCode-style or competitive-programming-style problem, FrontierSmith applies principled mutations that turn it into a high-quality open-ended optimization problem.

45d901

Qiuyang Mang@MangQiuyang

The surviving ideas are converted into clean, runnable training environments. We reuse the FrontierCS judge sandbox and generate two pieces for each task. We evaluate on two open-ended coding benchmarks:FrontierCS, using the 172 algorithmic open-ended tasks. ALE-bench-lite, derived from AtCoder Heuristic Contest-style optimization tasks.

For training, we synthesize 200 FrontierSmith problems and run GRPO on Qwen3.5-9B and Qwen3.5-27B. We compare against several controls: training on human-curated FrontierCS problems, training on ALE-bench, training directly on 200 closed-ended HardTests problems, and training on FrontierCS with random rewards. The results are direct. FrontierSmith-generated data is strong enough to match or exceed human-curated open-ended training data.

45d821

Qiuyang Mang@MangQiuyang

Mutation creates many candidates, but not every candidate is useful. Some are still effectively closed-ended. Some are open-ended in wording but dominated by one obvious strategy.

Our key filtering signal is idea divergence. We cannot ask an LLM to prove whether a problem is P or NP-hard, or whether an optimum is reachable under a fixed compute budget. We can, however, sample solutions from different solvers and ask whether they explore meaningfully different algorithmic ideas.

Open-ended problems tend to produce diverse solution strategies. Closed-ended problems are often dominated by a single "gold idea."

45d721

Qiuyang Mang@MangQiuyang

@kexun_zhang yes! Everyone should check out Hardtests if their research wants to benefit from large-scale high quality competitive programming data

45d60

Hanchen Li@lihanc02

@MangQiuyang @RunyuanHe @ZhouShang @KaiyuanLiu04 @HuanzhiMao @qizhengz_alex @Builtin_Pb @YichuanM @shangjingbo @AlexGDimakis @profjoeyg @alvinkcheung Where is the real smith image?

45d17