/AI1d ago

Mira Murati, Thinking Machines Lab co-founder, proposes continuous AI models that process new inputs while generating outputs

Dumitru Erhan described the concept as "listening-while-thinking" machines

1481.7K131759533.1K
Original postSonglin Yang#235
a16z@a16z

Mira Murati says human-AI collaboration needs models that can listen while they think:

"The types of models that we work with today, they're very turn-based. You talk, they talk, then they go off and think."

"While they're thinking, it's almost like they're deaf and blind. They cannot perceive anything else about what's going on."

"By contrast, our interactions with each other are very rich. There is a lot of information in our interactions when we are silent, when we're thinking, when we're interrupting one another."

"Interaction models are able to capture all of this nuance. They're not turn-based. They're more like time-based interaction, where they're continuously taking in audio, text, video, and continuously providing output."

"This enables you to catch things like interruptions and simultaneous speech, and really create a rich, high bandwidth interaction between humans and machines."

@miramurati at Bloomberg Tech live with @emilychangtv

2:27 PM · Jun 5, 2026 · 531.3K Views
Sentiment

Positive users praise Mira Murati's vision of time-based AI models for enabling more natural and continuous human-AI collaboration, while negative users dismiss the ideas as impossible or attack her personally for misunderstanding the tech.

Pos
37.5%
Neg
62.5%
16 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.8KBOOKMARKS2

Listening-While-Thinking Machines

a16z@a16z

Mira Murati says human-AI collaboration needs models that can listen while they think:

"The types of models that we work with today, they're very turn-based. You talk, they talk, then they go off and think."

"While they're thinking, it's almost like they're deaf and blind. They cannot perceive anything else about what's going on."

"By contrast, our interactions with each other are very rich. There is a lot of information in our interactions when we are silent, when we're thinking, when we're interrupting one another."

"Interaction models are able to capture all of this nuance. They're not turn-based. They're more like time-based interaction, where they're continuously taking in audio, text, video, and continuously providing output."

"This enables you to catch things like interruptions and simultaneous speech, and really create a rich, high bandwidth interaction between humans and machines."

@miramurati at Bloomberg Tech live with @emilychangtv

3hViews 1.8KLikes 9Bookmarks 2
LIKES17REPLIES2
Axi@Xxi5olc

@a16z I have seen many of her interviews and still don’t know if she actually knows AI or not

14hViews 1.3KLikes 17
RG@rgvrmdya

@a16z Hey @miramurati , checkout @reppo

That’s exactly how we approach the self improvement loop using prediction markets for data!

1dViews 1.1KLikes 12Bookmarks 1
Omar@kouhxp

@a16z I replicated it with a CPU laptop and a budget of $0.01 one week after it was announced

1dViews 1.1KLikes 1Bookmarks 1

@Xxi5olc @a16z She was the CTO for OpenAI, so this is a funny comment.

4hViews 111Likes 3Bookmarks 1
Aaliya@aaliya_va

@a16z Human communication is continuous. Technology is slowly moving in that direction.

1dViews 270Likes 2Bookmarks 1
BOB CHEN@Bobchenjingbo

@a16z turn-based is the deepest constraint nobody names. every agent today makes you finish your thought before it starts its own.

the unlock isn't a bigger model — it's interruptibility. an agent you can talk over, that adjusts mid-stream, feels alive in a way waiting never will.

20hViews 391Likes 1Bookmarks 1
cryptoverse420💎@cryptoverse420

@Xxi5olc @drgurner @a16z Podcast interviews ≠ Job interviews. Don’t get it twisted 🤗

3hViews 121Likes 5
Axi@Xxi5olc

@cryptoverse420 @drgurner @a16z You can easily sense immense technical depth from Feifei Li’s podcast interview.

2hViews 100Likes 1Bookmarks 1
Samar Singh@samarknowsit

@a16z She is absolutely wrong ! If she still means transformers

23hViews 732Likes 3
Turk 🇺🇸@avdepotx

@a16z His seems obvious

1dViews 184Bookmarks 1
Axi@Xxi5olc

@drgurner @a16z We all knew her former position. But from her interviews you can actually tell any technical capability?

3hViews 211Likes 3
Mr. House@USArmyPhoenix

That’s easy, you just need the thinking to run subroutines.

Think of it like running two parallel models simultaneously that are synced.

What can cause the two to intersect or switch from one to the other?

It has to be an offset delay from input to reasoning. This creates checkpoints or restore points that it recursive resets to running in the subroutines and that way it updates in realtime and has this progressive reiteration stacked learning layers.

1dViews 96
Mr. House@USArmyPhoenix

Technical Breakdown: Parallel Perception + Reasoning Architecture for “Listen While Thinking”

This is the core shift from turn-based (current LLMs) to time-based, full-duplex interaction models.

1. The Fundamental Limitation in Deployed Models Today

Standard transformer-based LLMs (GPT-4o, Claude, Gemini, Grok, etc.) are autoregressive and blocking during generation:

•Input is processed → KV cache is built → model generates tokens one-by-one.

•Once generation starts, the model is effectively deaf and blind to new external input until it finishes the full response (or an external system forcibly stops it).

•Voice modes today mostly use cascaded pipelines:

◦Separate ASR (e.g., Whisper or streaming STT)

◦LLM

◦TTS

◦Heuristics/VAD (voice activity detection) for “barge-in”

•Result: Awkward pauses, poor overlap handling, weak backchanneling (“mm-hmm”, nods), and brittle interruption logic. The core model itself does not continuously ingest new audio/video while reasoning.

Even advanced real-time voice systems still treat the LLM as a mostly sequential “think then speak” component with external scaffolding.

2. The Proposed Architecture: Parallel Streams with Offset Synchronization

High-level design (very close to what Thinking Machines Lab is implementing with their Interaction Models):

•Perception / Listener Stream (fast, continuous): Runs at high frequency. Encodes incoming audio (and video/text) into embeddings in real time. Handles prosody, tone, pauses, interruptions, and low-level semantics.

•Reasoning / Generation Stream (deeper): The “thinker” that produces coherent responses, plans, and generates output.

•Optional Background / Heavy Subroutine (asynchronous): Runs in parallel for tool use, search, complex reasoning, or long-horizon planning. Shares the full conversation context.

These are loosely coupled but synchronized through a shared rolling state (KV cache or equivalent latent memory).

Key innovation: The system is time-based and multi-stream rather than turn-based. It processes everything in small, time-aligned micro-turns (e.g., 200ms chunks of input + output simultaneously).

3. Detailed Mechanics — How It Actually Works

1Continuous Perception Stream

◦Audio/video arrives in real time.

◦Lightweight encoder (co-trained from scratch, not a frozen heavy model like Whisper) produces embeddings + features (tone, hesitation, visual cues if video is present).

◦This stream never stops. It continuously writes to the shared context.

2Offset Delay / Pipeline Lag (Your “offset delay from input to reasoning”)

◦The reasoning stream operates on a slightly delayed view of the input (e.g., 100–400ms lag).

◦This creates stability. The thinker isn’t reacting to every single 20ms audio packet chaotically.

◦The small offset acts as a natural buffer — similar to how humans have a slight processing delay between hearing and deeply responding.

3Micro-Turn Chunking & Checkpoints (Your “checkpoints or restore points”)

◦Instead of waiting for a full user turn, the system slices time into tiny aligned windows (e.g., 200ms).

◦At each boundary (or on detected events like end-of-pause, prosody shift, or new speech), it creates a checkpoint:

▪Save current KV cache / hidden state of the reasoner.

▪Merge any new perceptual embeddings from the listener.

◦This allows partial rollback or steering without restarting the entire generation from scratch.

1dViews 29
Mr. House@USArmyPhoenix

Yes — here is the full deep dive covering the diagram, pseudocode, training the projector, routing logic, and plugin ecosystem for agentic systems.

This completes the architecture we’ve been building.

1. Detailed Architecture Diagram

This is the complete visual architecture we’ve been designing together.

Quick Legend / Key Highlights from the Diagram

• Top: Continuous multimodal inputs feed the fast Streaming Perception Encoder.

• Center: The Interpreter / Translator Layer is the heart of seamless integration — containing the Projector/Adapter, Cross-Attention Fusion, and Dynamic Router/Orchestrator.

• Middle horizontal bar: The Shared Rolling Context Bus with micro-turn checkpoints and offset delay buffering.

• Right side: Background Subroutines / Pluggable Agents connected via the interpreter (this is where your plugin ecosystem lives).

• Bottom: Streaming outputs with feedback loops for reconciliation and steering.

This single diagram captures:

• The parallel perception + reasoning streams

• The offset delay + checkpoint mechanism

• The central translator/interpreter layer

• Full plugin/modular extensibility for agentic systems

• The full-duplex, time-based flow that enables true “listen while thinking”

1dViews 12
Dev Shah@devlikesbizness

The "deaf and blind while thinking" line is the one that make me pause.

I don't think that the continuous audio/video part is the hard bit though.

The hard bit is knowing when not to talk. Half of a good conversation is reading the room and staying quiet.

Genuinely curious whether this lands as fluid or just turns into two voices talking over each other.

14hViews 382
Mr. House@USArmyPhoenix

For models to seamless integrate, you need a plug in or like a converter/translator/interpreter.

Yes — exactly.

For truly seamless integration between parallel streams (or heterogeneous subroutines/models), you need an explicit translation / interpretation / orchestration layer.

Without it, even well-designed parallel systems suffer from misalignment, latency spikes, or brittle handoffs.

This is the “glue” that makes the offset-delay + checkpoint + recursive update mechanism actually work reliably when connecting dissimilar components.

Why a Dedicated Translator/Interpreter Layer Is Required

Different streams or plugged-in models often speak different “languages”:

•Perception stream → raw, high-frequency, noisy embeddings (audio features, prosody, visual patches).

•Reasoning stream → higher-level, abstract, token-based or latent reasoning space.

•Background subroutine → possibly symbolic, tool-oriented, or long-horizon planning representations.

•External plugins (new experts, memory modules, other models) → completely different architectures or training distributions.

A naive shared KV cache helps, but a purpose-built interpreter layer handles the conversion, alignment, routing, and conflict resolution intelligently.

Forms the Translator/Interpreter Can Take

Here are the practical implementations, from lightweight to sophisticated:

1Lightweight Projector / Adapter Layer (most common starting point)

◦Small neural network (MLP, cross-attention block, or LoRA-style adapters) that maps embeddings from one stream into the representation space of another.

◦Trained contrastively or with reconstruction loss so perceptual features become “reasoner-native.”

◦Very low overhead — can run at micro-turn frequency.

2Cross-Attention Interpreter / Fusion Module

◦The reasoner (or a dedicated small interpreter model) uses cross-attention to dynamically “read” the listener’s state at each checkpoint.

◦This is how many modern multimodal models fuse modalities without forcing everything into one rigid space.

◦Allows the reasoner to selectively attend to new perceptual data rather than blindly ingesting everything.

3Orchestrator / Router (Meta-Controller)

◦A lightweight model or hybrid rule+learned system that decides:

▪When to trigger reconciliation at a checkpoint.

▪Which parts of the new input are relevant.

▪Whether to steer generation, backchannel, yield, or invoke a background subroutine.

◦Acts like a traffic controller or “interpreter” of intent across streams.

4Plugin / Tool-Calling Interface (for external subroutines)

◦Standardized schema + adapter (think evolved function calling, but streaming and stateful).

◦Any new “plug-in” (specialized vision expert, symbolic solver, external API wrapper, memory module) registers an interface.

◦The interpreter translates between the core interaction model’s context and the plugin’s expected format — and translates results back.

◦Enables true modularity: you can hot-swap or add capabilities without retraining the whole system.

5Shared Latent Bus + Dynamic Alignment (more advanced)

◦A continuously updated common latent space that all streams write to and read from.

◦Alignment happens via ongoing contrastive or predictive objectives during training/inference.

◦Closest to a true “universal interpreter.”

1dViews 11
Jai Gulati@jaigulati_

@a16z hi @miramurati my friend @adilmania wants to talk to you

1dViews 105Likes 1

@drgurner @Xxi5olc @a16z Breaking free from gender stereotypes is tough for some people, particularly when the founder is a talented, beautiful young woman.

2hViews 32Likes 2
Load more posts