/AI1h ago

Cognition AI launches a $10 million productivity guarantee to refund enterprise customers if its Devin AI agent fails to deliver value

Productivity is calculated in engineering hours instead of tokens.

41320269129.6K

Quote posts

#214

Comments

#706

Original post

Scott Wu@ScottWu46#720inAI

Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org?

We built a system to measure how many productive engineering hours every Devin task is worth, validated against a dataset of real engineers’ times estimates. The goal is to answer the fundamental question that companies are grappling with: how much real value are you getting from each of your agent sessions?

On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does.

The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.

11:23 AM · Jun 4, 2026 · 26.4K Views

/AI1h ago

Cognition AI launches a $10 million productivity guarantee to refund enterprise customers if its Devin AI agent fails to deliver value

Productivity is calculated in engineering hours instead of tokens.

--0--

Quote posts

#214

Comments

#706

Original post

Scott Wu@ScottWu46#720inAI

Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org?

On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does.

The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.

11:23 AM · Jun 4, 2026 · 26.4K Views

Sentiment

Positive users praise Cognition's $10M productivity guarantee for Devin as a novel focus on real output and AI utility, while negative users suspect it is rigged or incentivizes misuse.

Pos

63.0%

Neg

37.0%

20 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS2.6KLIKES37RETWEETS6

Walden@walden_yan

In a world where teams are burning through token budgets without clear ROI, we've developed scalable ways to measure the value of agents' work. And now we're offering customers up to $10M in guaranteed output with Devin.

1h2.6K376

BOOKMARKS8REPLIES15

swyx@swyx

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯

METR dataset: ML eng, GPU kernels, cybersecurity

> "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog of 0.83

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

> "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog of 0.74 on held out set

this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

Posts from X

Most Activity

VIEWS2.6KLIKES37RETWEETS6

Walden@walden_yan

1h2.6K376

BOOKMARKS8REPLIES15

swyx@swyx

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯

METR dataset: ML eng, GPU kernels, cybersecurity

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

57m2.5K338