/Tech8h ago

Voice And Screen Inputs Boost AI Agent Reliability And Task Scale

206553210K

Original post

elvis@omarsar0#684inTech

Finally caved in, and I now fully speak to agents as opposed to typing prompts.

My first realization is that you can just blabber on and tell the agent so many rich details via audio. The longer and the more detailed the audio explanation, the better the results.

The most interesting thing about interacting with the agent this way is that I can parallelize more work and enable agents to perform way longer runs, implementing many things at once.

In addition, I have developed a new feature where I can record the screen, take screenshots, track mouse actions and movements, annotate, and explain (using voice) to the agent things that it struggles with, like design and precise feature development.

My finding is that the richer the prompt modality, the more reliable the agent results are. The noise (if any) doesn't even matter. Yes, it's more expensive (i.e., lots more tokens used this way), but the reliability that you are getting is worth it.

I often store those as reusable commands/skills where it applies and inject them into loops.

The results are night and day.

9:17 AM · Jun 24, 2026 · 8.5K Views

Sentiment

Positive users praise voice and screen inputs for AI agents because they enable richer context sharing, reduce typing bottlenecks, and improve reliability through natural modalities.

Pos

100.0%

Neg

0.0%

11 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.5KBOOKMARKS2LIKES10REPLIES3

elvis@omarsar0

If there is interest, I am happy to do a live session showing how I use the whole voice + annotation functionality. And how I reuse them as skills. I think it's fascinating. Just let me know.

elvis@omarsar0

Finally caved in, and I now fully speak to agents as opposed to typing prompts.

My first realization is that you can just blabber on and tell the agent so many rich details via audio. The longer and the more detailed the audio explanation, the better the results.

The most interesting thing about interacting with the agent this way is that I can parallelize more work and enable agents to perform way longer runs, implementing many things at once.

I often store those as reusable commands/skills where it applies and inject them into loops.

The results are night and day.

7h1.5K102

Austin Born@austinbuilds

@omarsar0 What tools do you use to speak to your agents? I've found that iOS's built-in speech-to-text feature doesn't hear my voice well, and other tools like Whisperflow still miss my intent too frequently.

8h481

Jerry the Martian@jerry543

@omarsar0 curious how you handle technical terms tho, variable names and api calls over voice have always tripped me up

8h461

elvis@omarsar0

That's interesting. I guess as long as things are in context, it's okay. But I have started to notice some hiccups in the initial part of the conversation or when I completely switch the task in an agent session. I guess a dictionary would be good to keep, but I haven't had too many issues with it.

7h491

elvis@omarsar0

@austinbuilds probably overkill but elevenlabs. i find elevenlabs deals with my accent and stuff really well :)

7h45

NoiselessTrading@neha041187

Love this shift to voice-first — the ‘just blabber on’ approach is such a game changer. Richer modalities really do unlock way better agent behavior. The screen recording + mouse tracking + voice annotations combo sounds powerful. Have you tried feeding those multimodal sessions back into the agent as self-reflection loops (e.g., “review what you just built from this recording”)? Night and day results make total sense. The token cost is the tax for reliability. Keep shipping 🔥

8h16

Hunter Gon@gonlenidefi

@omarsar0 the art of verbal vomit finally gets rewarded instead of punished

i respect the self awareness on the rambling meta

7h13

Eclipse 🌖@ECLresearch

@omarsar0 This is the underrated UX unlock of the agent era—voice allows for parallel context injection that typing bottlenecks.

8h12

Daniel MacLean@macleanestrada

@omarsar0 install headroom, and dictate a book if you wish. It really works

7h8

Jitin Kapila@jitinkapila

@omarsar0 I have been doing that by small local STT or Win + H on windows. Nothing beats speaking to system than writing each small small snippets.

7h2

V0LYX@0xV0LYX

@omarsar0 yeah, the bandwidth difference is wild. u say stuff out loud ur brain didnt even know it was gonna say

7h2

Utkarsh Singh@Utkarsh51557661

@omarsar0 I've tried that—way better than typing. Less chance to overthink and you get to share the whole context.

7h2

Sven Nachtzeit@SvenUrbanSci

@omarsar0 Design specification is hard to express textually to agents. Your visual annotation approach addresses the grounding problem well. Have you compared reliability gains against structured design tokens or component references?

7h1

Jay Kurtz@JayKurtz90

@omarsar0 Yeh I use a similar workflow and would love to see how you do this