Finally caved in, and I now fully speak to agents as opposed to typing prompts.
My first realization is that you can just blabber on and tell the agent so many rich details via audio. The longer and the more detailed the audio explanation, the better the results.
The most interesting thing about interacting with the agent this way is that I can parallelize more work and enable agents to perform way longer runs, implementing many things at once.
In addition, I have developed a new feature where I can record the screen, take screenshots, track mouse actions and movements, annotate, and explain (using voice) to the agent things that it struggles with, like design and precise feature development.
My finding is that the richer the prompt modality, the more reliable the agent results are. The noise (if any) doesn't even matter. Yes, it's more expensive (i.e., lots more tokens used this way), but the reliability that you are getting is worth it.
I often store those as reusable commands/skills where it applies and inject them into loops.
The results are night and day.











