/AI3h ago

Fable AI Wins 4 of 5 NVIDIA GPU Kernel Optimization Challenges

51075538.1K

#811

Original post

Edward Z. Yang#811

Doğaç@dogacel0

I've let Fable optimize GPU kernels autonomously using "auto-gpu-kernel" harness, if it joined the NVIDIA's competition today, it would have won 🥇 in 4/5 kernels against humans.

Fable can write Gluon kernels, do warp-specialization, use TMA tcgen05 etc.

(Speedup vs Opus 4.8)

5:27 AM · Jun 10, 2026 · 8.1K Views

/AI3h ago

Fable AI Wins 4 of 5 NVIDIA GPU Kernel Optimization Challenges

51075538.1K

#811

Original post

Edward Z. Yang#811

Doğaç@dogacel0

I've let Fable optimize GPU kernels autonomously using "auto-gpu-kernel" harness, if it joined the NVIDIA's competition today, it would have won 🥇 in 4/5 kernels against humans.

Fable can write Gluon kernels, do warp-specialization, use TMA tcgen05 etc.

(Speedup vs Opus 4.8)

5:27 AM · Jun 10, 2026 · 8.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS218LIKES4

maharshi@maharshii

@dogacel0 i wonder how good fable is with cute dsl, it might be able to go more low-level there…

2h2184

BOOKMARKS1

Doğaç@dogacel0

The biggest difference I saw between Opus 4.8 and Fable is Opus can't generate Gluon kernels that are faster than Triton as it fails to handle complexity of warp-specialization or other techniques. As Fable can use a lower-level DSL consistently, it can write faster kernels.

3h10721

RETWEETS1

Doğaç@dogacel0

Fable both converges faster and decides to stop earlier unlike Opus. Deciding to stop early might be a safety limit

MoE path had more known optimizations and I've heard people have discovered them by prompting agents continously.

3h762

REPLIES1

Doğaç@dogacel0

@maharshii Exactly, my harness was designed Opus in mind, I think right now CuTe might be the right abstraction. I will test that when I have time.

2h461

Doğaç@dogacel0

Also Fable spawns many sub-agents, utilizes git worktree by itself and overall converges faster. For reference Opus 4.8 kernels took around 12 hours to finish, whereas Fable finished under 8 hours while spending significantly less GPU credits.

3h1063

Doğaç@dogacel0

Harness + Generated Kernels are available here 👇🏻

https://github.com/Dogacel/auto-gpu-kernel

3h624

maharshi@maharshii

@dogacel0 yes, considering the fact that one can go to the level of nvvm and ptx if needed in cute DSL

2h282