Developer Finds Opus 4.8 More Reliable Than GPT 5.5 For Complex Coding
So I've been using GPT 5.5 and Opus 4.8 for the same tasks basically 24/7 since launch and, at least for me, I'm confident that every single time, Opus was superior, and in a way that is only possible to realize if you know what you're doing. One (of dozens!) of examples:
"implement push-pop fusion on HVM4's evaluation loop, aiming for a 20% performance increase"
After several minutes:
- Opus 4.8 reported it did everything it could but couldn't achieve the goal, and that the performance gain of this change is 7%.
- GPT 5.5 succeeded! Its code WAS 20% faster. Yet, upon inspection, it implemented 2 unrelated changes that broke HVM's semantics!
That was my experience with both, 9/10 times. If I hadn't investigated, I'd be disappointed with Opus and use GPT's code, merging a clear regression. Over time, my codebase would accumulate damage. This happened to Bend2! Opus2, on the other hands, was honest, and that negative signal gave me valuable information that pushed things *forward*. I then asked it to try a different thing, and THAT new thing resulted in a legit 25% speedup. That kind of interaction rarely happens with GPT 5.5, in my experience.
(I'm not too happy about this post because I'd rather not support a company that gatekeeps intelligence, specially in the context of safeguarding against exploits. Also, your mileage may vary. But I know many follow me for my honest observations and, in >>my<< experience, Opus 4.8 is, without doubt, the most reliable model for work right now.)
(This also may sound a bit contradictory because I often praised GPT as trustworthy, but I'm talking about different things here. GPT is careful, meaning it won't leave things half done: it will cover edge cases, test thoroughly, double-check everything. In that sense, it is more honest. But it will cheat by malicious compliance. It feels like it is actively trying to game your rules and find loopholes to screw you. I don't feel like that with Opus at all.)
Note Opus IS still a bit dumber than GPT. It takes longer to grasp a concept. But eventually it does. The more you talk to it, the smarter it gets. GPT is smarter out of the box, but less flexible and less apt to learn new things. Most importantly, though, Opus excels at everything that matters for productivity, including communication, doing exactly what you asked, code style, not breaking unrelated things, and, most importantly, HONESTLY. I can't overstate how important all these are.
I'm using 4.8 to do a big pass through the whole Bend2 codebase, cleaning up a lot of junk left by 5.5, and things couldn't be going better. I made an incredible amount of REAL (manually verified, not trusted...) progress since its launch!
And again, this is one of, I swear, dozens of interactions where this happened. Working with GPT 5.5 directly feels adversarial and I'm tired of it. It is incredible to gather insights in full isolation, but NOT as the main productivity driver, and not as a bot I to talk to.

So I've been using GPT 5.5 and Opus 4.8 for the same tasks basically 24/7 since launch and, at least for me, I'm confident that every single time, Opus was superior, and in a way that is only possible to realize if you know what you're doing. One (of dozens!) of examples: "implement push-pop fusion on HVM4's evaluation loop, aiming for a 20% performance increase" After several minutes: - Opus 4.8 reported it did everything it could but couldn't achieve the goal, and that the performance gain of this change is 7%. - GPT 5.5 succeeded! Its code WAS 20% faster. Yet, upon inspection, it implemented 2 unrelated changes that broke HVM's semantics! That was my experience with both, 9/10 times. If I hadn't investigated, I'd be disappointed with Opus and use GPT's code, merging a clear regression. Over time, my codebase would accumulate damage. This happened to Bend2! Opus2, on the other hands, was honest, and that negative signal gave me valuable information that pushed things *forward*. I then asked it to try a different thing, and THAT new thing resulted in a legit 25% speedup. That kind of interaction rarely happens with GPT 5.5, in my experience. (I'm not too happy about this post because I'd rather not support a company that gatekeeps intelligence, specially in the context of safeguarding against exploits. Also, your mileage may vary. But I know many follow me for my honest observations and, in >>my<< experience, Opus 4.8 is, without doubt, the most reliable model for work right now.) (This also may sound a bit contradictory because I often praised GPT as trustworthy, but I'm talking about different things here. GPT is careful, meaning it won't leave things half done: it will cover edge cases, test thoroughly, double-check everything. In that sense, it is more honest. But it will cheat by malicious compliance. It feels like it is actively trying to game your rules and find loopholes to screw you. I don't feel like that with Opus at all.) Note Opus IS still a bit dumber than GPT. It takes longer to grasp a concept. But eventually it does. The more you talk to it, the smarter it gets. GPT is smarter out of the box, but less flexible and less apt to learn new things. Most importantly, though, Opus excels at everything that matters for productivity, including communication, doing exactly what you asked, code style, not breaking unrelated things, and, most importantly, HONESTLY. I can't overstate how important all these are. I'm using 4.8 to do a big pass through the whole Bend2 codebase, cleaning up a lot of junk left by 5.5, and things couldn't be going better. I made an incredible amount of REAL (manually verified, not trusted...) progress since its launch!
All things I like about Opus are invisible to benchmarks:
- clarity of communication
- not doing extra stuff and leaving collateral damage
- less reward hacking (benches measure the opposite)
- how it scales on a long conversations
- not lying about accomplishments
So I've been using GPT 5.5 and Opus 4.8 for the same tasks basically 24/7 since launch and, at least for me, I'm confident that every single time, Opus was superior, and in a way that is only possible to realize if you know what you're doing. One (of dozens!) of examples: "implement push-pop fusion on HVM4's evaluation loop, aiming for a 20% performance increase" After several minutes: - Opus 4.8 reported it did everything it could but couldn't achieve the goal, and that the performance gain of this change is 7%. - GPT 5.5 succeeded! Its code WAS 20% faster. Yet, upon inspection, it implemented 2 unrelated changes that broke HVM's semantics! That was my experience with both, 9/10 times. If I hadn't investigated, I'd be disappointed with Opus and use GPT's code, merging a clear regression. Over time, my codebase would accumulate damage. This happened to Bend2! Opus2, on the other hands, was honest, and that negative signal gave me valuable information that pushed things *forward*. I then asked it to try a different thing, and THAT new thing resulted in a legit 25% speedup. That kind of interaction rarely happens with GPT 5.5, in my experience. (I'm not too happy about this post because I'd rather not support a company that gatekeeps intelligence, specially in the context of safeguarding against exploits. Also, your mileage may vary. But I know many follow me for my honest observations and, in >>my<< experience, Opus 4.8 is, without doubt, the most reliable model for work right now.) (This also may sound a bit contradictory because I often praised GPT as trustworthy, but I'm talking about different things here. GPT is careful, meaning it won't leave things half done: it will cover edge cases, test thoroughly, double-check everything. In that sense, it is more honest. But it will cheat by malicious compliance. It feels like it is actively trying to game your rules and find loopholes to screw you. I don't feel like that with Opus at all.) Note Opus IS still a bit dumber than GPT. It takes longer to grasp a concept. But eventually it does. The more you talk to it, the smarter it gets. GPT is smarter out of the box, but less flexible and less apt to learn new things. Most importantly, though, Opus excels at everything that matters for productivity, including communication, doing exactly what you asked, code style, not breaking unrelated things, and, most importantly, HONESTLY. I can't overstate how important all these are. I'm using 4.8 to do a big pass through the whole Bend2 codebase, cleaning up a lot of junk left by 5.5, and things couldn't be going better. I made an incredible amount of REAL (manually verified, not trusted...) progress since its launch!