/Tech2d ago

A benchmark by Stanford NLP's Chenglei Si finds Claude-Fable-5 leads on autoresearch, while open-weight Kimi-K2.7-Code tops ML engineering

Claude-Fable-5 maintained its overall lead under cost constraints.

--0--

#468

Original post

Zhengyao Jiang@zhengyaojiang

We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and algorithmic discovery.

Fable-5 won overall even under cost constraint, but on ML engineering, the open model Kimi-K2.7-Code surpassed frontier models.🧵(1/5)

10:36 AM · Jun 14, 2026 · 57.1K Views

Sentiment

Many users called the Fable-5 and Kimi-K2.7-Code benchmark results impressive because they highlight strong open-model performance on autoresearch and ML engineering tasks.

Pos

87.5%

Neg

12.5%

9 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS5.8KBOOKMARKS9LIKES56

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Really really strange set of results throughout I can believe that Fable was nerfed and Kimi K2.7 is relatively great at ML engineering but they also get Gemini 3.1 at/near the top on may tasks

Zhengyao Jiang@zhengyaojiang

Surprisingly though, we found that a recent open model, Kimi-K2.7, performed very well on ML engineering. And Fable performed even worse than Opus. This could be either because of the inflated cost, or the guardrails put on ML tasks. (4/5)

2d5.8K569

RETWEETS2REPLIES3

Zhengyao Jiang@zhengyaojiang

2d2.9K429

Zhengyao Jiang@zhengyaojiang

Overall, it seems like the model supply chain will be less stable in the autoresearch space.

On the Weco side, we’ll stay model-neutral and provide more options for our users. Today, we just added support for Kimi-2.7 (5/5)

If you’re interested: https://www.weco.ai/

2d1.8K166

Zhengyao Jiang@zhengyaojiang

Overall, Fable is a really strong model for autoresearch. It dominates harness/prompt engineering and algorithmic discovery tasks. We were especially surprised by the algorithmic discovery results, because the eval cost is low and cheaper models can run many more steps. (3/5)

2d2.7K292

Zhengyao Jiang@zhengyaojiang

@sanmking @OpenRouter Yes it's possible I believe @SakanaAILabs had some research on this https://sakana.ai/ab-mcts/

2d51973

Zhengyao Jiang@zhengyaojiang

Benchmark protocol: - cost (LLM + eval cost) constrained, not steps constrained. This means an agent can run more steps if its model or solutions are cheaper to run - All of the models use the autonomous research harness behind the @WecoAI service - The scores should be interpreted as how good the final solution is compared to a naive ReAct agent. (2/5)

2d3K19

Santiago M.@sanmking

@zhengyaojiang Did you do an analysis on similarity of answers or scores across problems. I’m specially curious given the recent results from @OpenRouter Fusion API:

Maybe a bag of models, would perform better. Specially, in a verifiable domain as autoresearch.

2d1.4K41

Zhengyao Jiang@zhengyaojiang

@alokbishoyi97 yeah it's not quite good at heuristic engineering & more conventional algorithm design

2d54821

Zhengyao Jiang@zhengyaojiang

On MLE, the Opus vs. GPT-5.5 gap is very small, so I wouldn’t read too much into it (can be noise).

On harness tuning, that’s a fair concern. It has been tuned for models from different providers throughout the development process. It’s possible there’s some bias here but it’s hard to tell which specific model gets an advantage.

2d56711

Zhengyao Jiang@zhengyaojiang

Thanks Davide! It is quite noisy but we ran a lot of seeds, the aggregated number should be rather robust

I think a key difference between our benchmark and others is that we’re cost-bound. Claude models are generally quite expensive, which leads to fewer iteration steps. Also, Claude is somewhat weaker at some niche tasks that differ from conventional software engineering. For example, it was quite bad at MLE until Opus 4.6, and is still bad at algorithmic/heuristic engineering.

2d23811

Alok Bishoyi@alokbishoyi97

@zhengyaojiang Wow, opus 4-8 so down the list!?

2d6833

Mario Filho@mariofilhoml

@zhengyaojiang Quite interesting, thanks! Can you share an example of ML engineering task?

I’m curious if it’s tradML, deep learning fine tuning or pipeline, etc

2d4312

Burny - Effective Curiosity@burny_tech

@zhengyaojiang Super interesting

2d3982

Morgan McGuire@morgymcg

@zhengyaojiang interesting, the vibes with opus 4.7 for ml engineering seemed better than 5.5 to me - just vibes tho i guess. reckon your harness is tuned to 1 vs the other?

2d6901

Zhengyao Jiang@zhengyaojiang

@TheGrizztronic yes similar as in: https://arxiv.org/html/2605.21384v1

2d3571

Davide Paglieri@PaglieriDavide

@zhengyaojiang Very cool stuff! Surprised to see that despite other benchmarks, Opus 4.8 is lagging the rest in this. How replicable is this/how noisy?

2d3001