/Tech6h ago

Mira Murati proposes shifting from turn-based AI to time-based models that continuously ingest inputs during generation

This continuous processing would enable natural, real-time interruptions.

1461.6K113735494K

#512

Original post

a16z@a16z

Mira Murati says human-AI collaboration needs models that can listen while they think:

"The types of models that we work with today, they're very turn-based. You talk, they talk, then they go off and think."

"While they're thinking, it's almost like they're deaf and blind. They cannot perceive anything else about what's going on."

"By contrast, our interactions with each other are very rich. There is a lot of information in our interactions when we are silent, when we're thinking, when we're interrupting one another."

"Interaction models are able to capture all of this nuance. They're not turn-based. They're more like time-based interaction, where they're continuously taking in audio, text, video, and continuously providing output."

"This enables you to catch things like interruptions and simultaneous speech, and really create a rich, high bandwidth interaction between humans and machines."

@miramurati at Bloomberg Tech live with @emilychangtv

2:27 PM · Jun 5, 2026 · 500.7K Views

/Tech6h ago

Mira Murati proposes shifting from turn-based AI to time-based models that continuously ingest inputs during generation

This continuous processing would enable natural, real-time interruptions.

1461.6K113735494K

#512

Original post

a16z@a16z

Mira Murati says human-AI collaboration needs models that can listen while they think:

"The types of models that we work with today, they're very turn-based. You talk, they talk, then they go off and think."

"While they're thinking, it's almost like they're deaf and blind. They cannot perceive anything else about what's going on."

"By contrast, our interactions with each other are very rich. There is a lot of information in our interactions when we are silent, when we're thinking, when we're interrupting one another."

"This enables you to catch things like interruptions and simultaneous speech, and really create a rich, high bandwidth interaction between humans and machines."

@miramurati at Bloomberg Tech live with @emilychangtv

2:27 PM · Jun 5, 2026 · 500.7K Views

Sentiment

Positive users praise Mira Murati's vision of continuous AI interaction models for enabling more natural and collaborative human-machine exchanges, while negative users dismiss her expertise or resort to personal insults.

Pos

50.0%

Neg

50.0%

12 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.5KBOOKMARKS2

Dumitru Erhan@doomie

Listening-While-Thinking Machines

a16z@a16z

Mira Murati says human-AI collaboration needs models that can listen while they think:

"The types of models that we work with today, they're very turn-based. You talk, they talk, then they go off and think."

"While they're thinking, it's almost like they're deaf and blind. They cannot perceive anything else about what's going on."

"By contrast, our interactions with each other are very rich. There is a lot of information in our interactions when we are silent, when we're thinking, when we're interrupting one another."

"This enables you to catch things like interruptions and simultaneous speech, and really create a rich, high bandwidth interaction between humans and machines."

@miramurati at Bloomberg Tech live with @emilychangtv

3h1.5K82

LIKES17REPLIES2

Axi@Xxi5olc

@a16z I have seen many of her interviews and still don’t know if she actually knows AI or not

14h1.3K17

RETWEETS1

@A_ILovesTech®️ 🔑🚘@SPCXWallet

@a16z Yes,

Exactly this

Designing is Life … ✏️✨

18h30

RG@rgvrmdya

@a16z Hey @miramurati , checkout @reppo

That’s exactly how we approach the self improvement loop using prediction markets for data!

1d1.1K121

Omar@kouhxp

@a16z I replicated it with a CPU laptop and a budget of $0.01 one week after it was announced

1d1.1K11

Dr. Julie Gurner@drgurner

@Xxi5olc @a16z She was the CTO for OpenAI, so this is a funny comment.

4h11131

Aaliya@aaliya_va

@a16z Human communication is continuous. Technology is slowly moving in that direction.

1d27021

BOB CHEN@Bobchenjingbo

@a16z turn-based is the deepest constraint nobody names. every agent today makes you finish your thought before it starts its own.

the unlock isn't a bigger model — it's interruptibility. an agent you can talk over, that adjusts mid-stream, feels alive in a way waiting never will.

19h39111

Samar Singh@samarknowsit

@a16z She is absolutely wrong ! If she still means transformers

23h7323

Turk 🇺🇸@avdepotx

@a16z His seems obvious

1d1841

Mr. House@USArmyPhoenix

That’s easy, you just need the thinking to run subroutines.

Think of it like running two parallel models simultaneously that are synced.

What can cause the two to intersect or switch from one to the other?

It has to be an offset delay from input to reasoning. This creates checkpoints or restore points that it recursive resets to running in the subroutines and that way it updates in realtime and has this progressive reiteration stacked learning layers.

1d96

Mr. House@USArmyPhoenix

Technical Breakdown: Parallel Perception + Reasoning Architecture for “Listen While Thinking”

This is the core shift from turn-based (current LLMs) to time-based, full-duplex interaction models.

1. The Fundamental Limitation in Deployed Models Today

Standard transformer-based LLMs (GPT-4o, Claude, Gemini, Grok, etc.) are autoregressive and blocking during generation:

•Input is processed → KV cache is built → model generates tokens one-by-one.

•Once generation starts, the model is effectively deaf and blind to new external input until it finishes the full response (or an external system forcibly stops it).

•Voice modes today mostly use cascaded pipelines:

◦Separate ASR (e.g., Whisper or streaming STT)

◦LLM

◦TTS

◦Heuristics/VAD (voice activity detection) for “barge-in”

•Result: Awkward pauses, poor overlap handling, weak backchanneling (“mm-hmm”, nods), and brittle interruption logic. The core model itself does not continuously ingest new audio/video while reasoning.

Even advanced real-time voice systems still treat the LLM as a mostly sequential “think then speak” component with external scaffolding.

2. The Proposed Architecture: Parallel Streams with Offset Synchronization

High-level design (very close to what Thinking Machines Lab is implementing with their Interaction Models):

•Perception / Listener Stream (fast, continuous): Runs at high frequency. Encodes incoming audio (and video/text) into embeddings in real time. Handles prosody, tone, pauses, interruptions, and low-level semantics.

•Reasoning / Generation Stream (deeper): The “thinker” that produces coherent responses, plans, and generates output.

•Optional Background / Heavy Subroutine (asynchronous): Runs in parallel for tool use, search, complex reasoning, or long-horizon planning. Shares the full conversation context.

These are loosely coupled but synchronized through a shared rolling state (KV cache or equivalent latent memory).

Key innovation: The system is time-based and multi-stream rather than turn-based. It processes everything in small, time-aligned micro-turns (e.g., 200ms chunks of input + output simultaneously).

3. Detailed Mechanics — How It Actually Works

1Continuous Perception Stream

◦Audio/video arrives in real time.

◦Lightweight encoder (co-trained from scratch, not a frozen heavy model like Whisper) produces embeddings + features (tone, hesitation, visual cues if video is present).

◦This stream never stops. It continuously writes to the shared context.

2Offset Delay / Pipeline Lag (Your “offset delay from input to reasoning”)

◦The reasoning stream operates on a slightly delayed view of the input (e.g., 100–400ms lag).

◦This creates stability. The thinker isn’t reacting to every single 20ms audio packet chaotically.

◦The small offset acts as a natural buffer — similar to how humans have a slight processing delay between hearing and deeply responding.

3Micro-Turn Chunking & Checkpoints (Your “checkpoints or restore points”)

◦Instead of waiting for a full user turn, the system slices time into tiny aligned windows (e.g., 200ms).

◦At each boundary (or on detected events like end-of-pause, prosody shift, or new speech), it creates a checkpoint:

▪Save current KV cache / hidden state of the reasoner.

▪Merge any new perceptual embeddings from the listener.

◦This allows partial rollback or steering without restarting the entire generation from scratch.

1d29

Mr. House@USArmyPhoenix

Yes — here is the full deep dive covering the diagram, pseudocode, training the projector, routing logic, and plugin ecosystem for agentic systems.

This completes the architecture we’ve been building.

1. Detailed Architecture Diagram

This is the complete visual architecture we’ve been designing together.

Quick Legend / Key Highlights from the Diagram

• Top: Continuous multimodal inputs feed the fast Streaming Perception Encoder.

• Center: The Interpreter / Translator Layer is the heart of seamless integration — containing the Projector/Adapter, Cross-Attention Fusion, and Dynamic Router/Orchestrator.

• Middle horizontal bar: The Shared Rolling Context Bus with micro-turn checkpoints and offset delay buffering.

• Right side: Background Subroutines / Pluggable Agents connected via the interpreter (this is where your plugin ecosystem lives).

• Bottom: Streaming outputs with feedback loops for reconciliation and steering.

This single diagram captures:

• The parallel perception + reasoning streams

• The offset delay + checkpoint mechanism

• The central translator/interpreter layer

• Full plugin/modular extensibility for agentic systems

• The full-duplex, time-based flow that enables true “listen while thinking”

23h12

Dev Shah@devlikesbizness

The "deaf and blind while thinking" line is the one that make me pause.

I don't think that the continuous audio/video part is the hard bit though.

The hard bit is knowing when not to talk. Half of a good conversation is reading the room and staying quiet.

Genuinely curious whether this lands as fluid or just turns into two voices talking over each other.

13h382

Mr. House@USArmyPhoenix

For models to seamless integrate, you need a plug in or like a converter/translator/interpreter.

Yes — exactly.

For truly seamless integration between parallel streams (or heterogeneous subroutines/models), you need an explicit translation / interpretation / orchestration layer.

Without it, even well-designed parallel systems suffer from misalignment, latency spikes, or brittle handoffs.

This is the “glue” that makes the offset-delay + checkpoint + recursive update mechanism actually work reliably when connecting dissimilar components.

Why a Dedicated Translator/Interpreter Layer Is Required

Different streams or plugged-in models often speak different “languages”:

•Perception stream → raw, high-frequency, noisy embeddings (audio features, prosody, visual patches).

•Reasoning stream → higher-level, abstract, token-based or latent reasoning space.

•Background subroutine → possibly symbolic, tool-oriented, or long-horizon planning representations.

•External plugins (new experts, memory modules, other models) → completely different architectures or training distributions.

A naive shared KV cache helps, but a purpose-built interpreter layer handles the conversion, alignment, routing, and conflict resolution intelligently.

Forms the Translator/Interpreter Can Take

Here are the practical implementations, from lightweight to sophisticated:

1Lightweight Projector / Adapter Layer (most common starting point)

◦Small neural network (MLP, cross-attention block, or LoRA-style adapters) that maps embeddings from one stream into the representation space of another.

◦Trained contrastively or with reconstruction loss so perceptual features become “reasoner-native.”

◦Very low overhead — can run at micro-turn frequency.

2Cross-Attention Interpreter / Fusion Module

◦The reasoner (or a dedicated small interpreter model) uses cross-attention to dynamically “read” the listener’s state at each checkpoint.

◦This is how many modern multimodal models fuse modalities without forcing everything into one rigid space.

◦Allows the reasoner to selectively attend to new perceptual data rather than blindly ingesting everything.

3Orchestrator / Router (Meta-Controller)

◦A lightweight model or hybrid rule+learned system that decides:

▪When to trigger reconciliation at a checkpoint.

▪Which parts of the new input are relevant.

▪Whether to steer generation, backchannel, yield, or invoke a background subroutine.

◦Acts like a traffic controller or “interpreter” of intent across streams.

4Plugin / Tool-Calling Interface (for external subroutines)

◦Standardized schema + adapter (think evolved function calling, but streaming and stateful).

◦Any new “plug-in” (specialized vision expert, symbolic solver, external API wrapper, memory module) registers an interface.

◦The interpreter translates between the core interaction model’s context and the plugin’s expected format — and translates results back.

◦Enables true modularity: you can hot-swap or add capabilities without retraining the whole system.

5Shared Latent Bus + Dynamic Alignment (more advanced)

◦A continuously updated common latent space that all streams write to and read from.

◦Alignment happens via ongoing contrastive or predictive objectives during training/inference.

◦Closest to a true “universal interpreter.”

23h11

Jai Gulati@jaigulati_

@a16z hi @miramurati my friend @adilmania wants to talk to you

1d1051

Kerem Ozkan@keremozkan

@a16z This is especially going to be very important for robotics. AI reasoning layer in robotics stacks is also very turn-like with the physical world.

22h274

GetOffMyDickerson@_kaboosky

@a16z This is impossible btw.

Crazy A16Z has people this dumb 😭

She thinks you can just have an endless context window.

Earth doesn’t have enough energy to supply the compute for that shit

4h3603

Nicolas Granatino🌻@ngranati

@a16z Looks like the France didn't just give French Theory to the world...

Paris-based @kyutai also gave Moshi back in 2024.

https://github.com/kyutai-labs/moshi

21h2833

podEssence@podEssence_app

@miramurati I respectfully disagree. We already have enough trouble with people who don’t truly listen. Let AI be the one that does — patient, always-on, never waiting for its turn.

Sometimes the “rigid” turn-based approach is actually a strength when human attention spans are so short.

1d8332