Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.
Ramp launches Ramp SWE-Bench, a private coding benchmark built from real engineering challenges to evaluate AI on unseen tasks
Eric Glyman says frontier models train on public benchmarks.
Many users praised Ramp's private SWE-Bench release for grounding AI coding benchmarks in real engineering problems, while some called it too easy, already saturated, or classic bubble behavior.
Most Activity
public benchmarks are saturated. every frontier model has trained against them, and the leaderboard tells you near nothing.
we built ours from inside ramp — code no model has seen, graded against the bar our engineers ship to.
every company running on AI needs its own.
Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

When measuring effectiveness versus cost, the frontier presents as a tradeoff rather than a single winner.
Read our methodology and explore the results below: http://labs.ramp.com/swebench

Public benchmarks saturate quickly and inevitably leak into training data, with none quite resembling the work our engineers do every day. Building our own benchmark has allowed us to evaluate models within our own financial software ecosystem.
We compared models side by side and unearthed their behavioral differences. Head to head breakdowns available here: http://labs.ramp.com/swebench
@eglyman http://programbench.com http://codeclash.ai http://algotune.io http://swefficiency.com http://critpt.com http://videogamebench.com i think we have a few unsaturated public benchmarks let me know if you need more
public benchmarks are saturated. every frontier model has trained against them, and the leaderboard tells you near nothing.
we built ours from inside ramp — code no model has seen, graded against the bar our engineers ship to.
every company running on AI needs its own.
Ramp has a cool post about their new internal SWE-bench and they use mini-swe-agent as the harness powering the eval. DeepSWE also used mini-swe-agent as the harness, and showed that it performs just as well as Claude Code and Codex on the tasks.
Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

@RampLabs Great work! And if anyone would like to make their own personal SWE-Bench like this, based on real merged PRs into their codebase, we got you!

@RampLabs a triple platinum banger!
Read about mini-swe-agent in DeepSWE here: https://deepswe.datacurve.ai/blog#evaluation-harness
w/ @KLieret @jyangballin
Ramp has a cool post about their new internal SWE-bench and they use mini-swe-agent as the harness powering the eval. DeepSWE also used mini-swe-agent as the harness, and showed that it performs just as well as Claude Code and Codex on the tasks.

@RampLabs we ported mini-swe-agent in opencode ui. might be relevant!

@RampLabs ahahahahahaha Ramp the AI Lab fuck yea

@RampLabs where's goat, Opus 4.6?

@RampLabs I’m personally a huge fan of this!

@RampLabs The rate at which benchmarks are saturated is accelerating across the board. Fable debuting at ~85% on day 1 is crazy!

@OfirPress Any tooling recs for companies to build their own internal SWE-benches (tasks/dataset), beyond that harness?

@eglyman eric let me get more allocation pleaseeeeeeeeeeeeeee Ramp gonna go to 1 trillion at this point with AI lab

@RampLabs ramp is becoming another frontier lab eh

@RampLabs Ramp pivots to RL env company ‼️

@RampLabs @karimatiyeh this is great

@RampLabs Benchmarks are the new moat

@RampLabs That’s cool af