Have been extensively testing Claude Workflows this weekend, with the best model possible. Threw it at my whole code base, combing for bugs. 144 found and fixed! Geez... It is a large code base, for sure, but 144?!! Some are very impactful, some are downright embarrassing...
Shopify CTO Mikhail Parakhin says a weekend test of Claude Workflows on his codebase found and fixed 144 bugs
The issues ranged from highly impactful to embarrassing oversights
Some users praise Claude for honestly finding many bugs in a large codebase with fresh eyes, while others advise verifying its audit results carefully.
Most Activity

@MParakhin What's the highest amount of agents you've seen it spin up? I've seen 54 to 133.

@stombya I always run a critique loop between several models. The false positive was 1 out of 174 (found 30 more since I posted). Without critique loop it is about 20 out of 174.
Have been extensively testing Claude Workflows this weekend, with the best model possible. Threw it at my whole code base, combing for bugs. 144 found and fixed! Geez... It is a large code base, for sure, but 144?!! Some are very impactful, some are downright embarrassing...

@MParakhin Did similar, working through them now. (had 30 total)

@MParakhin I do not recommend letting Claude look for bugs. Or rather, not actioning on anything unless Codex signs off on it.

@MParakhin What False positive rate are you getting

@iMuffined Still need to run a critique loop

@AnonymousQuant I do run a critique loop on everything, of course.

@MParakhin mythos?

@andrewmccalip About the same

@MParakhin this is the part that changes code review. agents can find a huge pile of issues, but the engineering work becomes deciding which fixes are worth merging.

@MParakhin I for one am thrilled to purchasing tokens at such a rate of 100+ agents at once. It was a bid burdensome to have to micromanage then one at a time.

@MParakhin 144 is wild, but the number I always want is the accept ratio — how many did you actually review before merging? My bug-combing runs surface real issues mixed with confidently-wrong rewrites, and the reviewing is where the cost shows up.

@MParakhin lol this is both impressive and terrifying
how long did it take to run through everything?

@MParakhin best model possible🤔

@Yingzhe0301 ;-)

@iMuffined @MParakhin seems like it...

@MParakhin You may want to check out Barnum if you want less magic and more determinism

@MParakhin Made Claude run through the entire Shopify?

@omerhuseyin About 4 hours