Do you have your coding agents include automated tests for the code that they write?
Many users endorse requiring automated tests for AI coding agents because they prevent errors and save trouble, while others object that agents often produce fake tests or create extra QA burdens.
Most Activity
(I'm firmly on team red/green TDD for agent code, I like having a test suite that protects against them breaking old features when they make new changes - https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/)
Do you have your coding agents include automated tests for the code that they write?

@simonw My agents don't write any code unless there are tests first!

@tharshan_09 I don't particularly care - too many tests is better than too few tests now that maintaining and deleting tests in the future is so much cheaper

@simonw in projects I work on I also describe human readable journeys in a md file, and then I ask agent to perform those steps and identify problems

@simonw I do but how do you prevent it writing dumb tests and just bloat the test count?

@simonw Almost always, but at a large team, I no longer trust in-code tests as a unit of defense because the agents can change it just as easily. And with all the code that's being written, reviewing anything, especially test files, is hard.

@simonw I was exploring how to inject requierements drift to the context of the agent after tests. That way the agent gets back to the loop knowing what’s wrong or misaligned from the instructions/requirements, what you guys think about this approach?

@simonw You’re signing yourself up to be end to end QA without them

@simonw Almost always, but not for code quality. For agent self-verification. Tests are the checkpoint system. Without them the agent has no way to know if step 7 broke what step 2 built.

@simonw yes, but it's pretty useless, it test the implementation it already did good and bad, not the actual expectation that would challenge it

@simonw I always add critic layers with quality screening, saved me a lot of time so I don’t need to confirm every time.
For other agentic purposes, specifically small models, the harness always comes with an observer agent classifying data, inspecting main agent in a feedback loop

@simonw Default should be: tests when the agent changes behavior, smoke checks when it wires plumbing, and explicit "no test added because..." when it does neither. The best agents do not just write tests; they expose what claim the test is supposed to prove.

@simonw Honestly yes, but agents write tests that confirm their own logic rather than challenge it. You're testing the model's assumptions, not the actual requirements.

@simonw Especially important for bug fixing. If it can’t reproduce it, I’m probably going in by hand.

@simonw Since using Codex I responsibly wait for it to set up automated tests despite my wishes.
I hate waiting, but I don't feel I'm a position to argue with it.

@simonw I have functional and integration tests that have to be planned upfront, as agents need special permissions to edit those directories. Unit tests are free to modify, but almost never catch any regression. I wonder what you guys do to have properly tested software.

@simonw If your coding agents aren’t writing their own tests, you haven’t built a developer tool…you’ve just built a high-speed technical debt generator. Code without an automated test suite isn't shipping; it's just liability.

@simonw I’m more surprised if anyone doesn’t do this
yes agents can reward hack the tests to pass, but writing them first *mostly* mitigates this

@simonw @simonw, i'd like to add a nuance though. their tests are checked and validated by different models. i.e. claude/chatgpt/grok/gemini. the planner writes the test plan before the worker agent starts.

right instinct. the failure mode i keep seeing is the one tdd cant catch — agent passes every test, output still wrong because the context it pulled was internally inconsistent. two documents contradicting each other, model treats both as ground truth. you can test the code. testing the context is the harder problem