/AI5h ago

Jiatao Gu releases NF-CoT, using normalizing flows for continuous latent LLM reasoning instead of discrete text tokens

The system supports GRPO training and KV-cache decoding.

139323439.9K

Original post

Jiatao Gu@thoma_gu#667inAI

🤔Can LLMs reason by sampling continuous thoughts — not just tokens?

Introducing NF-CoT: Latent Reasoning with Normalizing Flows. It samples continuous chain-of-thoughts directly in the stream of LLM with exact likelihood -- powered by STARFlow.

🌐Page: http://nf-cot.vercel.app

1:12 PM · Jun 8, 2026 · 5.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.5KLIKES14RETWEETS3

Murray Kang@haoqik322

Excited to share our follow-up work of LaDiR: Latent reasoning with Normalizing Flows (NF-CoT)!

Instead of iterative diffusion denoising, NF-CoT integrates STARFlow to generate continuous thoughts autoregressively in LLMs, like tokens — with exact likelihood, KV-cache-friendly decoding, and compatibility with policy-gradient RL training such as GRPO.

Jiatao Gu@thoma_gu

🤔Can LLMs reason by sampling continuous thoughts — not just tokens?

Introducing NF-CoT: Latent Reasoning with Normalizing Flows. It samples continuous chain-of-thoughts directly in the stream of LLM with exact likelihood -- powered by STARFlow.

🌐Page: http://nf-cot.vercel.app

4h1.5K143

BOOKMARKS4

Jiatao Gu@thoma_gu

This work was led by two amazing @PennEngineers master’s students, @Guancheng_Tu @EthanFu0355525.

Also huge thanks to our great collaborators @SuhaoYu1020, @tyao923, @haoqik322, @Lianhuiq, and @YizheZhangNLP!

More details: http://arxiv.org/abs/2606.06447 Code&Model: Coming soon.

Jiatao Gu@thoma_gu

NF-CoT is also cheaper than LaDiR.

Inference: 1.9× faster end-to-end, with 2.5× fewer FLOPs/sample. Training: 2.85× higher sample throughput, with 6.66× fewer FLOPs.

-> Latent thoughts are generated like tokens — autoregressively, cache-friendly, without iterative denoising.

5h32954

REPLIES1

Jiatao Gu@thoma_gu

🧐Why is this hard?

Simply moving CoT beyond tokens is not enough. A useful latent reasoning space hould still keep what makes token CoT powerful: • sampling diverse trajectories • scoring probabilistically • training with likelihood • decoding efficiently with KV caches

Jiatao Gu@thoma_gu

🤔Can LLMs reason by sampling continuous thoughts — not just tokens?

Introducing NF-CoT: Latent Reasoning with Normalizing Flows. It samples continuous chain-of-thoughts directly in the stream of LLM with exact likelihood -- powered by STARFlow.

🌐Page: http://nf-cot.vercel.app

5h73620

Murray Kang@haoqik322

Excited to share our follow-up work of LaDiR: Latent reasoning with Normalizing Flows (NF-CoT)!

Instead of iterative diffusion denoising, NF-CoT uses normalizing flows to generate continuous thoughts autoregressively, like tokens — with exact likelihood, KV-cache-friendly decoding, and compatibility with policy-gradient RL training such as GRPO.

Jiatao Gu@thoma_gu

🤔Can LLMs reason by sampling continuous thoughts — not just tokens?

Introducing NF-CoT: Latent Reasoning with Normalizing Flows. It samples continuous chain-of-thoughts directly in the stream of LLM with exact likelihood -- powered by STARFlow.

🌐Page: http://nf-cot.vercel.app

5h81482

Jiatao Gu@thoma_gu

Prior latent-CoT methods trade off. For example, • Coconut-style feedback is efficient, but mostly deterministic. • LaDiR-style diffusion latents are stochastic, but iterative and likelihood-intractable.

NF-CoT keeps the sweet spot -- stochasticity + likelihood + efficiency.

Jiatao Gu@thoma_gu

🧐Why is this hard?

Simply moving CoT beyond tokens is not enough. A useful latent reasoning space hould still keep what makes token CoT powerful: • sampling diverse trajectories • scoring probabilistically • training with likelihood • decoding efficiently with KV caches

5h37720

Jiatao Gu@thoma_gu

NF-CoT is also cheaper than LaDiR.

Inference: 1.9× faster end-to-end, with 2.5× fewer FLOPs/sample. Training: 2.85× higher sample throughput, with 6.66× fewer FLOPs.

-> Latent thoughts are generated like tokens — autoregressively, cache-friendly, without iterative denoising.

Jiatao Gu@thoma_gu

Results on Qwen3-8B-Base across 5 code benchmarks:

Avg pass@1 improves from 55.8 → 68.8 (+13.0), and reaches 70.1 after RL.

NF-CoT also outperforms the strongest latent baseline, LaDiR, by +7.1%. On MBPP+, pass@1 = 72.1 — matching the base model’s pass@128.

5h26710

Jiatao Gu@thoma_gu

💡The idea: place a STARFlow inside LLM!

STARFlow is a SOTA normalizing flow built from Deep-Shallow Autoregressive Transformers.

In NF-CoT, shallow invertible layers map continuous latents into a deep reasoning space, where the LLM models them left-to-right alongside text.

Jiatao Gu@thoma_gu

Prior latent-CoT methods trade off. For example, • Coconut-style feedback is efficient, but mostly deterministic. • LaDiR-style diffusion latents are stochastic, but iterative and likelihood-intractable.

NF-CoT keeps the sweet spot -- stochasticity + likelihood + efficiency.

5h12110

Jiatao Gu@thoma_gu

The whole model is trained end-to-end with NLL over both latent thoughts and text answers.

At inference time, it reasons autoregressively in the deep reasoning space, while still allowing latent thoughts to be inspected or decoded for better explainability.

Jiatao Gu@thoma_gu

💡The idea: place a STARFlow inside LLM!

STARFlow is a SOTA normalizing flow built from Deep-Shallow Autoregressive Transformers.

In NF-CoT, shallow invertible layers map continuous latents into a deep reasoning space, where the LLM models them left-to-right alongside text.

5h11610

Jiatao Gu@thoma_gu

Moreover, the exact likelihood over both latent reasoning and text answers makes NF-CoT compatible with GRPO-style post-training with verifiable rewards — like explicit CoT, but in the continuous space and pluggable into existing RL frameworks.

Jiatao Gu@thoma_gu

The whole model is trained end-to-end with NLL over both latent thoughts and text answers.

At inference time, it reasons autoregressively in the deep reasoning space, while still allowing latent thoughts to be inspected or decoded for better explainability.

5h10910

Jiatao Gu@thoma_gu

Results on Qwen3-8B-Base across 5 code benchmarks:

Avg pass@1 improves from 55.8 → 68.8 (+13.0), and reaches 70.1 after RL.

NF-CoT also outperforms the strongest latent baseline, LaDiR, by +7.1%. On MBPP+, pass@1 = 72.1 — matching the base model’s pass@128.

Jiatao Gu@thoma_gu

Moreover, the exact likelihood over both latent reasoning and text answers makes NF-CoT compatible with GRPO-style post-training with verifiable rewards — like explicit CoT, but in the continuous space and pluggable into existing RL frameworks.

5h10310