How much better do the models have to get before you'll stop reading the code?
T3 Stack creator Theo Browne asks how capable AI must get before developers stop reviewing its code
Story Overview
T3 Stack creator Theo Browne is probing the future point where AI code generation might earn enough developer confidence that human review becomes optional, a question that has fueled fresh discussion on whether today's models are anywhere close to that bar.
Trust numbers show persistent skepticism
Recent surveys put developer trust in AI output accuracy at just 29 percent, with 46 percent actively distrusting the results, and AI-assisted pull requests merging at roughly half the rate of human ones.
Benchmarks leave real-world gaps unclosed
Top models hit around 67 percent pass@1 on HumanEval and higher on some verified suites, yet issues like logic errors, security vulnerabilities in nearly half of generated code, and lower scores on harder tests mean the capability threshold for skipping reviews stays undefined.
Users are excited that AI models could make code review obsolete because cheap code generation lets engineers skip reading most of it, while others despise LLMs writing new code without extremely explicit guidelines.
No Digg Deeper questions have been answered for this story yet.
Most Activity
At this point I’m genuinely convinced most of you would have kept reading the assembly code after C got popular
How much better do the models have to get before you'll stop reading the code?
I'll be honest, I barely even read the code back when I wrote it by hand...
How much better do the models have to get before you'll stop reading the code?
I’m gonna do a video on the “you should still read your code” thing and it’s going to piss both sides off. I’m excited :)
@theo You read the code?
How much better do the models have to get before you'll stop reading the code?
@theo Moving from assembly language to compilers, there was a 24 month window where it mattered.
How much better do the models have to get before you'll stop reading the code?

@zeeg Bold coming from someone whose code is gpt-3.5 level
i will stop reading the code when git blame starts blaming the model instead of me
How much better do the models have to get before you'll stop reading the code?

@theo two orders of magnitude with actual real verification capabilities

@theo the problem to solve here is the verification not the code

@theo @WallisDev you have little to lose
i - along with every other major business in the world - have a lot to lose
all it takes is a shitty data migration, a simple bypass to slip through and people face immense liability

If only there was a product to make it easier to identify bugs and fix them...
Jokes aside, there's obviously differences at different types and scales of software. I just know there's a lot of devs still reading code on sideprojects as if it matters. I'd go as far as saying that the majority of code at most companies is not as important as the company pretends it is (i.e. company blog, documentation sites, sdks that are just api wrappers, throwaway internal tools, api scaffolding, etc)

@theo got rejected in a recent interview for telling them its pointless to read code at this point

@theo about tree fiddy

@zeeg @WallisDev I spend a lot of time conversing with the model and getting a spec that we’re both aligned on. Once I’m confident in the surface and the model’s understanding, it’s genuinely hard to care about the details for me

let alone that a few sentences will never appropriately describe the thing you're trying to build - nor will generating a spec from those same few sentences. you need a massive speed increase on top of a massive precision/capability increase
(+a ton of supporting software that is scaleable and cheap that doesnt exist today to verify)

@theo You still read code ?
This was the topic of my talk at @aiDotEngineer - code got cheap, attention didn't!
How much better do the models have to get before you'll stop reading the code?

@glcst I wrote this before seeing your reply lol

@theo My current personal project is my benchmark, and I still feel the need to review the code. C is a tricky bugger for code that works and feels good to use as a library

I agree w him
I’d need some kind of test suite to give me confidence into putting it in actual production software
The question is too broad. different kinds of projects require different levels of scrutiny (ie. file system, database or core data structure? I hope you know how it fails)