Here's a baseline quick bench between Anthropic and OpenAI models which I'm now content with.
This test runs Warden's stock securitiy-review skill across some known Sentry priors. These are security findings (excluding defense-in-depth) of various levels.
I can't explain the changes in Opus 4.7/4.8, but what I can tell you is: its simply not a good product. If a model update breaks usage of a model, I'll use. different model.
I'll likely be exploring switching our implementation to run GPT 5.5 (low) for PR review, as that's the right cost tradeoff we're looking for.
Generally speaking, the way you can read this is simple:
Higher % of known is desirable. Higher total findings is a secondary value add, but cost is the tradeoff.
More findings != exclusively more value. It likely means it found more lower sev (or in this case, defense-in-depth-style), that you may not even care to address.
In general consider this just an interesting unscientific experiment, one that anecdotally is important but not aimining to be a true benchmark.
Read more in the Warden docs: https://warden.sentry.dev/benchmarking
p.s. if you find something wrong in the data please let me know