/Tech10h ago

Mechanize Inc. CTO Ege Erdil argues that "mechanical cleanliness" metrics are poorly suited for AI coding agent evaluations

Lisan al Gaib agreed automated checks will lose relevance.

213061.7K

#844

Original post

Ege Erdil@EgeErdil2#1351inTech

@scaling01 i think being prescriptive about mechanical cleanliness in an eval intended for AI agents is bad

AI agents don't operate with the same cleanliness / code quality concerns as humans, their capability profile is totally different

Lisan al Gaib@scaling01

Opus 4.8 is the best coding model out there

FrontierCode by Cognition is probably the highest quality coding benchmark we have seen so far

it moves beyond just using unit-testing for scoring, it also tests for regression safety, mechanical cleanliness, test correctness, scope and code quality

20+ open-source developers handcrafted 150 tasks, each of which took over 40 hours to construct

it also tests a more diverse set of programming languages

3:52 AM · Jun 9, 2026 · 908 Views

/Tech10h ago

Mechanize Inc. CTO Ege Erdil argues that "mechanical cleanliness" metrics are poorly suited for AI coding agent evaluations

Lisan al Gaib agreed automated checks will lose relevance.

213061.7K

#844

Original post

Ege Erdil@EgeErdil2#1351inTech

@scaling01 i think being prescriptive about mechanical cleanliness in an eval intended for AI agents is bad

AI agents don't operate with the same cleanliness / code quality concerns as humans, their capability profile is totally different

Lisan al Gaib@scaling01

Opus 4.8 is the best coding model out there

FrontierCode by Cognition is probably the highest quality coding benchmark we have seen so far

it moves beyond just using unit-testing for scoring, it also tests for regression safety, mechanical cleanliness, test correctness, scope and code quality

20+ open-source developers handcrafted 150 tasks, each of which took over 40 hours to construct

it also tests a more diverse set of programming languages

3:52 AM · Jun 9, 2026 · 908 Views

Sentiment

Some users dismissed the FrontierCode Benchmark's mechanical cleanliness metric as nonsensical and irrelevant to real performance.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS687BOOKMARKS1LIKES4REPLIES1

Ege Erdil@EgeErdil2

@scaling01 i think it makes about as much sense as grading a compiler's outputs of machine code on "mechanical cleanliness" from the POV of an asm programmer

you'd never do that because it makes no sense. you'd focus on correctness, performance, memory usage, binary sizes, etc.

Ege Erdil@EgeErdil2

@scaling01 i think being prescriptive about mechanical cleanliness in an eval intended for AI agents is bad

AI agents don't operate with the same cleanliness / code quality concerns as humans, their capability profile is totally different

10h68741

Lisan al Gaib@scaling01

@EgeErdil2 Mechanical cleanliness just means that it runs passes building, linting and style checks. This is separate from code quality.

But I agree that in the limit this doesn't make much sense once LLMs are better than humans at writing code

Ege Erdil@EgeErdil2

@scaling01 i think it makes about as much sense as grading a compiler's outputs of machine code on "mechanical cleanliness" from the POV of an asm programmer

you'd never do that because it makes no sense. you'd focus on correctness, performance, memory usage, binary sizes, etc.

9h8920