/AI6h ago

Founder Benchmarks Anthropic And OpenAI Models For Security Reviews

16478165K

Original posts

Reposts

#982

Original post

Ben (no treats)#982

David Cramer@zeeg

Here's a baseline quick bench between Anthropic and OpenAI models which I'm now content with.

This test runs Warden's stock securitiy-review skill across some known Sentry priors. These are security findings (excluding defense-in-depth) of various levels.

I can't explain the changes in Opus 4.7/4.8, but what I can tell you is: its simply not a good product. If a model update breaks usage of a model, I'll use. different model.

I'll likely be exploring switching our implementation to run GPT 5.5 (low) for PR review, as that's the right cost tradeoff we're looking for.

Generally speaking, the way you can read this is simple:

Higher % of known is desirable. Higher total findings is a secondary value add, but cost is the tradeoff.

More findings != exclusively more value. It likely means it found more lower sev (or in this case, defense-in-depth-style), that you may not even care to address.

In general consider this just an interesting unscientific experiment, one that anecdotally is important but not aimining to be a true benchmark.

Read more in the Warden docs: https://warden.sentry.dev/benchmarking

p.s. if you find something wrong in the data please let me know

12:14 AM · Jun 4, 2026 · 5K Views

/AI6h ago

Founder Benchmarks Anthropic And OpenAI Models For Security Reviews

--0--

Original posts

Reposts

#982

Original post

Ben (no treats)#982

David Cramer@zeeg

Here's a baseline quick bench between Anthropic and OpenAI models which I'm now content with.

This test runs Warden's stock securitiy-review skill across some known Sentry priors. These are security findings (excluding defense-in-depth) of various levels.

I can't explain the changes in Opus 4.7/4.8, but what I can tell you is: its simply not a good product. If a model update breaks usage of a model, I'll use. different model.

I'll likely be exploring switching our implementation to run GPT 5.5 (low) for PR review, as that's the right cost tradeoff we're looking for.

Generally speaking, the way you can read this is simple:

Higher % of known is desirable. Higher total findings is a secondary value add, but cost is the tradeoff.

More findings != exclusively more value. It likely means it found more lower sev (or in this case, defense-in-depth-style), that you may not even care to address.

In general consider this just an interesting unscientific experiment, one that anecdotally is important but not aimining to be a true benchmark.

Read more in the Warden docs: https://warden.sentry.dev/benchmarking

p.s. if you find something wrong in the data please let me know

12:14 AM · Jun 4, 2026 · 5K Views

Sentiment

Positive users praise specific GPT variants as best and low-cost for bug finding in security reviews while negative users accuse deliberate capability limits and call Claude SDK performance shocking.

Pos

50.0%

Neg

50.0%

6 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

No ranked X posts are available for this story yet.