/Tech3h ago

Ramp launches Ramp SWE-Bench, a private coding benchmark built from real engineering challenges to evaluate AI on unseen tasks

Eric Glyman says frontier models train on public benchmarks.

7486927321105.3K

#78

Original post

Ramp Labs@RampLabs

Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

10:26 AM · Jun 12, 2026 · 85.1K Views

Sentiment

Many users praised Ramp's private SWE-Bench release for grounding AI coding benchmarks in real engineering problems, while some called it too easy, already saturated, or classic bubble behavior.

Pos

85.7%

Neg

14.3%

22 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS16.1KBOOKMARKS21LIKES111RETWEETS4REPLIES10

Eric Glyman@eglyman

public benchmarks are saturated. every frontier model has trained against them, and the leaderboard tells you near nothing.

we built ours from inside ramp — code no model has seen, graded against the bar our engineers ship to.

every company running on AI needs its own.

Ramp Labs@RampLabs

Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

2h16.1K11121

Ramp Labs@RampLabs

When measuring effectiveness versus cost, the frontier presents as a tradeoff rather than a single winner.

Read our methodology and explore the results below: http://labs.ramp.com/swebench

4h3.3K3411

Ramp Labs@RampLabs

Public benchmarks saturate quickly and inevitably leak into training data, with none quite resembling the work our engineers do every day. Building our own benchmark has allowed us to evaluate models within our own financial software ecosystem.

We compared models side by side and unearthed their behavioral differences. Head to head breakdowns available here: http://labs.ramp.com/swebench

4h3.3K316

Ofir Press@OfirPress

@eglyman http://programbench.com http://codeclash.ai http://algotune.io http://swefficiency.com http://critpt.com http://videogamebench.com i think we have a few unsaturated public benchmarks let me know if you need more

Eric Glyman@eglyman

public benchmarks are saturated. every frontier model has trained against them, and the leaderboard tells you near nothing.

we built ours from inside ramp — code no model has seen, graded against the bar our engineers ship to.

every company running on AI needs its own.

1h626277

Ofir Press@OfirPress

Ramp has a cool post about their new internal SWE-bench and they use mini-swe-agent as the harness powering the eval. DeepSWE also used mini-swe-agent as the harness, and showed that it performs just as well as Claude Code and Codex on the tasks.

Ramp Labs@RampLabs

Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

1h1.8K65

Sergey Karayev@sergeykarayev

@RampLabs Great work! And if anyone would like to make their own personal SWE-Bench like this, based on real merged PRs into their codebase, we got you!

4h58373