3h ago

AI Coding Tools Plateau After 4.0 Jump, Benchmarks Mislead

0
Original post

this has been my experience as well. there definitely were improvements, specifically wrt shell based computer use, but also regressions, especially in the last 3 version bumps of flicker and gerpertee corp. the last step change was 3.x to 4.x in flicker land, probably mostly due to them getting all coding sessions from april to october 2025 via CC. similar timeline with GPT and Codex. at least in my line of work, no big jumps after that. the benchmark increases mean literally nothing in the real world. i suppose we have a data problem now. only so much you can RL into those damn things. and with ralph loops/swarms/agents reviewing agents/whatever, you get less and less human signal to improve RL, would be my uneducated guess. also very hard to capture design/system thinking in RL would be my guess. all that said: if we are at the top of the S curve now, then i'll take what we got. plenty useful, even if it won't replace me fully nor partially anytime soon.

11:32 AM · May 23, 2026 View on X