WebGPU kernels generated by AI agent Fable 5 are now public, accelerating Gemma 4 to 255 tokens per second in-browser · Digg

WebGPU kernels generated by AI agent Fable 5 are now public, accelerating Gemma 4 to 255 tokens per second in-browser · Digg

Posts from X

Most Activity

VIEWS9.8KBOOKMARKS41LIKES68RETWEETS8REPLIES4

Omar Sanseviero@osanseviero

Gemma 4 with 255 tokens per second running directly in your browser

Xenova@xenovacom

Before Fable 5 was shut down, it pushed Gemma 4 to 255 tok/s on WebGPU. Some didn't believe it was real.

Today we're releasing the demo and kernels it wrote for you to see yourself. Run it locally in your browser.

Agentic kernel optimization is the future of on-device inference

3h9.8K6841

👩‍💻 Paige Bailey@DynamicWebPaige

This speed is insane. 🤯🐧

And especially wild when you realize that it's running completely sandboxed in the browser, all data kept local:

Xenova@xenovacom

Before Fable 5 was shut down, it pushed Gemma 4 to 255 tok/s on WebGPU. Some didn't believe it was real.

Today we're releasing the demo and kernels it wrote for you to see yourself. Run it locally in your browser.

Agentic kernel optimization is the future of on-device inference

4h5.8K6226

Xenova@xenovacom

In case you hadn't noticed, we're working on something big. Stay tuned.

🔗 Link to the demo: https://huggingface.co/spaces/webml-community/gemma-4-webgpu-kernels

6h8382611

Omar Sanseviero@osanseviero

DiffusionGemma at 2000 tokens per second with 18GB RAM

Unsloth AI@UnslothAI

DiffusionGemma can now run at 2000+ tokens/sec! ⚡

We made local DiffusionGemma inference 1.8× faster.

Run it on 18GB RAM via Unsloth Studio.

GitHub: https://github.com/unslothai/unsloth Guide: https://unsloth.ai/docs/models/diffusiongemma

3h1.6K84

Steve💙🇨🇦@xyster

@xenovacom I used GPT 5.5 to take 4x Intel B70s running Minimax m2.7 from 14 to over 100 tok/s (decode rate).

It took two weeks of 24/7 auto research however. That's a lot of tokens!!

Fable and GPT Pro can do it much faster.

Even GLM 5.2 can do it I'm finding though; just slowly.

5h30752

The Singularity Project@01Singularity01

@xenovacom Failed to load: No supported WebGPU variant for com.xenova.gemma4.DecodeOprojNorm; rejected fused_rows: when guard resolved to false; fused: when guard resolved to false

5h43641

Samian@ApplyWiseAi

@xenovacom 255 tok/s on gemma 4 in browser is wild if it holds up. what's the model size they're running and is this with prefill + decode or just decode

5h30711

Ian Danforth@iand_elicit

@xenovacom As far as I can tell it's fast and not very high quality. So interesting technical work, but I wouldn't use the model for anything.

4h3621

xcaliburr@xscorpiox101

@xenovacom I'm much more interested if the output is still correct, quality normally deteriorates when speeds increases

5h2465

Jacob Van Wie@KeanuPianu

@DynamicWebPaige meh, im plenty busy sucking insane value out of grok build. As far as Fable 5 goes, you know what you call a sword that can't kill? A piece of shit. lol. Alignment is the cancer that stops all thought, because the human mind is not aligned, and cannot be, or we would be broken

3h71

Jimsta@Jimster4801

@adnanthekhan @xenovacom What do you mean by the abuse risks that come with it?

2h101

Adnan Khan@adnanthekhan

@Jimster4801 @xenovacom If you just vibe an inference endpoint on your web app threat actors can and will abuse it to run free inference.

You end up paying a premium for managed chatbot services or you have to build and harden your own.

2h15

Jacob Van Wie@KeanuPianu

@DynamicWebPaige Alignment is like saying, "we are making a perfect circle, but it can only be made out of right angles" Like okay friend, someday you will see intelligence is as cold and direct as outer space is, you can't take that dimensionality out of it and still have it whole

3h8

Jimsta@Jimster4801

@adnanthekhan @xenovacom Oh; You mean through exploits. Right, in this case it runs on the local machine so there is nothing to exploit.

2h7

Unni@karmakomik

@xenovacom Will try but I hope it did not just optimise for your GPU 😅

2h2281

Fab 🇧🇷🇨🇦@FlockonUS

@xenovacom How many GB will my browser load if i access the page?

4h1861

ansuman ☄️@ansuman_bin

@xenovacom bro retired too early!

2h279

billyG88@billyG881

@xenovacom another inflection point

6h248

SpamWaveTV@SpamWaveTV

@xenovacom too bad you didn't do qwen since the SWA makes gemma useless for agentic tasks

4h226

Tomasz Paruszewski 🇵🇱@tomparuszewski

@FlockonUS @xenovacom 2

3h701