@scaling01 i think being prescriptive about mechanical cleanliness in an eval intended for AI agents is bad
AI agents don't operate with the same cleanliness / code quality concerns as humans, their capability profile is totally different
Opus 4.8 is the best coding model out there
FrontierCode by Cognition is probably the highest quality coding benchmark we have seen so far
it moves beyond just using unit-testing for scoring, it also tests for regression safety, mechanical cleanliness, test correctness, scope and code quality
20+ open-source developers handcrafted 150 tasks, each of which took over 40 hours to construct
it also tests a more diverse set of programming languages