I disagree with the overall sentiment of this HN user but I have seen agents write terrible spaghetti code- that's exactly what CodeClash, led by @jyangballin & @KLieret, evaluates. An agent needs to maintain a codebase while facing an adversarial opponent multiple times. We see the failure cases mentioned here in the CodeClash trajectories frequently.
Together AI's Ashwinee Panda proposes a deletion-focused benchmark to stop AI coding agents from generating bloated code
The benchmark aims to curb additive bias in auto-RL workflows.
Positive users praise the CodeClash benchmark for AI coding agents and call AI coding amazing with room for further gains, while the negative reply notes most agents still cannot fix bugs under adversarial pressure.
No Digg Deeper questions have been answered for this story yet.
Most Activity

@jyangballin @KLieret AI coding is getting so much better all the time but there are still some facets of it that we don't have great benchmarks for. Once we do, we'll improve on those aspects as well.

@OfirPress @jyangballin @KLieret this is just true, though -agents _do_ prefer to add and never delete and never reuse. wdym you disagree?

@OfirPress @jyangballin @KLieret i think every individual sentence in the post is true but my overall sentiment is the opposite (AI coding is amazing)

@PandaAshwinee @jyangballin @KLieret I disagree with the notion (mentioned in the title) that "AI coding is a nightmare".

@OfirPress @jyangballin @KLieret having an adversary actively breaking ur code changes the game entirely. most agents cant even fix their own bugs let alone fight back

@jyangballin @KLieret https://codeclash.ai/

@0xV0LYX @jyangballin @KLieret In CodeClash the adversary doesn't touch your code, they battle you in a code-based arena like RobotRumble. So it's tough but not as tough as having someone manipulate your code.

@PandaAshwinee @jyangballin @KLieret Ya AI coding is amazing, and there's still a lot we can further improve

@OfirPress @jyangballin @KLieret where do you hit the wall? holding bigger code, or when changes ripple across multiple files?

@OfirPress @jyangballin @KLieret this is cool, is the idea basically to force the codebase to "make contact with reality" more?

@OfirPress @jyangballin @KLieret Agents write spaghetti code the same way anyone does: shortcuts work until they're tested. CodeClash runs the test early.