/Tech16h ago

Boris Cherny, Claude Code creator at Anthropic, shares five tips for running Claude Opus autonomously on multi-day software tasks

Key recommendations include enabling auto mode and cloud-based execution

3542.8K2203.3K472.2K

#213

Original post

Boris Cherny@bcherny#213inTech

Seeing a number of benchmarks showing Opus is the best model for long-running work.

Five tips for running Opus autonomously for hours/days:

1. Use auto mode for permissions, so Claude doesn’t ask for approval 2. Use dynamic workflows, to have Claude orchestrate hundreds/thousands of agents to get a task done 3. Use /goal or /loop, to nudge Claude to keep going until it’s done 4. Use Claude Code in the cloud, so you can close your laptop (easiest way is the desktop or mobile app) 5. Make sure Claude has a way to self-verify its work end to end: Claude in Chrome browser extension for web, iOS/Android sim MCP for mobile, a way to start the full web server or service for backend work

Rishi Desai@rishi_desai2

Can coding agents stay coherent over a 1 billion token budget?

Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?

Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.

6:16 PM · Jun 7, 2026 · 467.3K Views

/Tech16h ago

Boris Cherny, Claude Code creator at Anthropic, shares five tips for running Claude Opus autonomously on multi-day software tasks

Key recommendations include enabling auto mode and cloud-based execution

3542.8K2203.3K472.2K

#213

Original post

Boris Cherny@bcherny#213inTech

Seeing a number of benchmarks showing Opus is the best model for long-running work.

Five tips for running Opus autonomously for hours/days:

Rishi Desai@rishi_desai2

Can coding agents stay coherent over a 1 billion token budget?

Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?

Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.

6:16 PM · Jun 7, 2026 · 467.3K Views

Sentiment

Positive users praise Claude Opus self-verification and workflow features for long-horizon coding tasks, while negative users worry about undetected hallucinations and verification failures in multi-day agent loops.

Pos

43.6%

Neg

56.4%

67 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS90.9K

Tomer Stern@tomer_stern

@bcherny That’s all very fair and reasonable. I’m just saying, from the outside, it looks like a lot of awesome tools are being built and marketed on here towards Claude max users. People on enterprise accounts with $100-$500 spend a month don’t seem to be considered

15h90.9K5010

BOOKMARKS87LIKES65

Boris Cherny@bcherny

@remondimi Most important thing I’ve found is self-verification + dynamic workflows prompted with something like “use a workflow to test the result e2e in a browser using claude in chrome mcp. Especially look for edge cases and ui issues”

16h4.8K6587

RETWEETS4REPLIES13

elvis@omarsar0

Great tips.

In practice, this is how it roughly looks to run agents autonomously for hours or days.

/goal or /loop to keep it going.

Verification is crucial here.

Boris Cherny@bcherny

Seeing a number of benchmarks showing Opus is the best model for long-running work.

Five tips for running Opus autonomously for hours/days:

1h6.7K4743

Boris Cherny@bcherny

A few things I’ve used very long running sessions for:

- Building complex features - Migrating code from language X to Y - Migrating code from framework X to Y - Repeatedly profiling and optimizing code to hit a specific memory or CPU target - Finding and fixing flaky tests in CI - Profiling CI to make it faster

15h3.9K5545

Ray Amjad@theramjad

@bcherny It's pretty good! I just had a goal running for 19 hours which verified almost 300 user flows with Chrome.

15h7.7K4110

Boris Cherny@bcherny

@tomer_stern I think of it in terms of ROI rather than absolute cost: how much would it have cost to do the same work manually? Often the answer is weeks or even months of engineering time

16h6.5K315

Boris Cherny@bcherny

@jmorin35 Context rot isn’t a thing with 4.8 imo, but curious if that’s been your experience also

13h4.9K203

Boris Cherny@bcherny

@AyubiJS These are not designed for people to invoke them, though you can do so if you want. Just tell the model what you want to happen, and it will do the work to invoke the right skills for you

15h2.7K255

Tomer Stern@tomer_stern

@bcherny What I’m saying is the median data scientist or software engineer is going to have a boss that’s happy to pay for $100-$500 of spend a month.

Uber giving employees $1500 is going to be atypical.

15h5.6K561

Boris Cherny@bcherny

@Tuzoff Yes. It’s more powerful and more token-efficient

16h1.8K126

Tomer Stern@tomer_stern

@bcherny How expensive are these kinds of things for people on enterprise accounts? Is it realistic to be using opus to that extent with anything less than a $5,000 budget?

Or is the idea that this kind of thing only being advertised for people on personal max accounts

16h5.7K204

Boris Cherny@bcherny

@adrienv1520 Run /usage to see a breakdown of the specific skills, mcps, and plugins that are using your tokens

13h1.9K125

Alok Bishoyi@alokbishoyi97

@bcherny we have had the same experience as well , from all the 10k+ autoresearch runs that we have enabled via evo. Opus + cc , especially with the new workflows has proven to be the best substrate

thanks for all your work

http://github.com/evo-hq/evo

15h294105

Tomer Stern@tomer_stern

@bcherny And even that uber data scientist has an order of magnitude lower budget than one gets with a $200 a month Claude max budget.

Idk I’m sure you all are talking about this. Microsoft launching their own sub-frontier agentic coding model this month is probably not a coincidence

15h2.1K292

Justin Morin@jmorin35

Thank you Boris! My concern is how do you deal with compaction and / or the session getting too long and context rot at 600-700k+ sets in.. I get worried when I’m not on my computer in CC when I can’t see the context .. bc cloud version doesn’t show show context number or allow for manual compaction

13h5.6K72

Boris Cherny@bcherny

@tomer_stern Enterprise seat limits are configurable, maybe ask you your admin to increase limits?

15h6.3K133

Mike Remondi@remondimi

@bcherny This is all amazing if you can define your acceptance criteria. If you're in a metric driven domain like ML, you're in a great place. This is still REALLY hard for product engineering though. Still feels like taking your hands off the wheel too much

16h5.3K152

Boris Cherny@bcherny

@ameedjamous Just tell claude to use a workflow

15h2.2K82

tonbi@tonbistudio

@bcherny If anyone is confused and wants to learn more about loops, I did my best to synthesize the available information and try to break them down with real examples in this X article:

11h45773

Adrien Valcke@adrienv1520

@bcherny That’s all fantastic except once you run it and get out of usage limit in 2.5 hours even on the $200 Max plan. Don’t get why/how the usage limit is increasing so fast recently.

13h2.1K92