/AI1h ago

Cognition AI launches a $10 million productivity guarantee to refund enterprise customers if its Devin AI agent fails to deliver value

Productivity is calculated in engineering hours instead of tokens.

--0--
Original post
Scott Wu@ScottWu46#720inAI

Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org?

We built a system to measure how many productive engineering hours every Devin task is worth, validated against a dataset of real engineers’ times estimates. The goal is to answer the fundamental question that companies are grappling with: how much real value are you getting from each of your agent sessions?

On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does.

The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.

11:23 AM · Jun 4, 2026 · 26.4K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS2.6KLIKES37RETWEETS6
Walden@walden_yan

In a world where teams are burning through token budgets without clear ROI, we've developed scalable ways to measure the value of agents' work. And now we're offering customers up to $10M in guaranteed output with Devin.

1hViews 2.6KLikes 37Bookmarks 6
BOOKMARKS8REPLIES15
swyx@swyx

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯

METR dataset: ML eng, GPU kernels, cybersecurity

> "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog​ of 0.83

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

> "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog​ of 0.74 on held out set

this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

57mViews 2.5KLikes 33Bookmarks 8