Ants models surprisingly performed as well/better in other harnesses on benches, Armins blog hints that this might change in the near future
I had some vibes that Opus 4.8 was performing worse than older ones for some of uses that are off distribution and now I have the receipts. Latest Opus/Sonnet are causing tool invocation failures on Pi's edit tool when older ones did not! I wrote about it. https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/






