/AI6h ago

Founder Benchmarks Anthropic And OpenAI Models For Security Reviews

--0--
Original posts
Reposts
Original postBen (no treats)#982

Here's a baseline quick bench between Anthropic and OpenAI models which I'm now content with.

This test runs Warden's stock securitiy-review skill across some known Sentry priors. These are security findings (excluding defense-in-depth) of various levels.

I can't explain the changes in Opus 4.7/4.8, but what I can tell you is: its simply not a good product. If a model update breaks usage of a model, I'll use. different model.

I'll likely be exploring switching our implementation to run GPT 5.5 (low) for PR review, as that's the right cost tradeoff we're looking for.

Generally speaking, the way you can read this is simple:

Higher % of known is desirable. Higher total findings is a secondary value add, but cost is the tradeoff.

More findings != exclusively more value. It likely means it found more lower sev (or in this case, defense-in-depth-style), that you may not even care to address.

In general consider this just an interesting unscientific experiment, one that anecdotally is important but not aimining to be a true benchmark.

Read more in the Warden docs: https://warden.sentry.dev/benchmarking

p.s. if you find something wrong in the data please let me know

12:14 AM · Jun 4, 2026 · 5K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
No ranked X posts are available for this story yet.