Zaid Khan releases GPU Forecasters to predict CUDA and Triton kernel runtimes using LLMs as hardware surrogates

VIEWS12.4K

AK@_akhaliq

paper: https://huggingface.co/papers/2605.31464

27d12.4K1912

BOOKMARKS32LIKES79RETWEETS32REPLIES5

Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.

In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.

1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.

2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.

3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.

4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.

5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!

Thread 🧵👇

27d11.7K7932

Jaemin Cho @ ICML 2026 🇰🇷@jmin__cho

Can LLMs predict GPU kernel runtimes instead of measuring them on actual hardware?

We find that: - LLMs act as great selective surrogates (deferring to GPUs when unsure) - RL improves LLM accuracy & calibration - Kernel search becomes much more efficient

We're releasing 12K kernels + runtimes for the community to build on.

Great work led by Zaid! Check more details 🧵

Zaid Khan@codezakh

Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.

In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.

1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.

2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.

3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.

4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.

5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!

Thread 🧵👇

27d4.2K3415

Zaid Khan@codezakh

Appreciate the shoutout @_akhaliq for our work on "GPU Forecasters" exploring whether language models can act as selective surrogates for GPU kernel optimization! Details in our thread:

AK@_akhaliq

GPU Forecasters

Language Models as Selective Surrogates for Kernel Runtime Optimization

27d9.5K249

Charles 🎉 Frye @ AIEng World's Fair@charles_irl

new linter just dropped

AK@_akhaliq

GPU Forecasters

Language Models as Selective Surrogates for Kernel Runtime Optimization

27d2.6K268

Mohit Bansal@mohitban47

🚨 GPU Forecasters 👉 we explore if a reasoning model can be a selective world model of a GPU, forecasting a kernel's speed while deferring to real hardware when unsure, making kernel search more efficient.

Inside an evolutionary kernel search, the surrogate lets us explore many more candidates in imagination and run only the most promising on the GPU. We often find kernels as fast or faster using the same number of real GPU evaluations.

We also show that reinforcement learning with calibration rewards can teach the surrogate to know when it doesn't know, making it more reliable during search.

We see this as early work toward approximate world models of complex hardware-software systems!

🧵 👇

Zaid Khan@codezakh

Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.

In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.

1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.

2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.

3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.

4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.

5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!

Thread 🧵👇

27d2.5K265

Justin Chih-Yao Chen@cyjustinchen

🚨LLMs are increasingly used to generate GPU kernels, but evaluating those kernels still requires expensive compilation and execution on real hardware.

Can LLMs act not just as kernel generators, but also forecasting kernel performance and deferring to hardware only when uncertain?

Introducing ✨GPU Forecasters✨, our new study of LLMs as selective surrogates for GPU kernel optimization across: • 12,388 measured kernels across 118 operations • CUDA + Triton backends & 3 GPU types • 400M tokens + 600 GPU-hours

We find that: 1⃣Off-the-shelf LLMs can predict relative kernel performance surprisingly well. Measuring only the top 10% of LLM-ranked candidates recovers kernels within 20% of the best available. 2⃣Accuracy alone isn't enough. A useful surrogate must be calibrated, i.e., knowing when to trust its forecasts and when to defer to the GPU. 3⃣Inside a real evolutionary kernel search, the surrogate evaluates far more candidates under the same GPU budget, leading to faster kernels than an equal-budget baseline.

More results, analysis, and released data in the thread 🧵👇

Zaid Khan@codezakh

Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.

In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.

1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.

2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.

3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.

4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.

5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!

Thread 🧵👇

27d1.4K134

Elias Stengel-Eskin@EliasEskin

GPU kernels are the engines powering NNs, making their optimization a key to self-improving agents. But search over kernels is expensive because eval on hardware takes time.

We train calibrated surrogate models that forecast kernel speedups w/out execution. Calibration is key here as it lets us perform selective prediction, off-loading uncertain predictions to the GPU while trusting more certain ones.

We see this as a first step towards building world models for hardware-software systems!

Key findings:

▪️ We find that off-the-shelf models can perform forecasting and we show how we can use calibration losses to improve them ▪️ We also show how our selective surrogate models can be incorporated into real kernel searches, leading search to converge on faster kernels under the same budget and breaking out of stagnant searches ▪️ Along the way, we built up a sizeable dataset of >12k generated kernels with their runtimes. This is an important resource for future work in this area, and opens up a lot of interesting research directions in predicting kernel performance.

Check out the 🧵 and paper for more details! 👇

Zaid Khan@codezakh

Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.

In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.

1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.

2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.

3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.

4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.

5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!

Thread 🧵👇

27d1.3K123

AK@_akhaliq

GPU Forecasters

Language Models as Selective Surrogates for Kernel Runtime Optimization

27d33.7K10045

Zaid Khan@codezakh

Predicting an exact runtime from source alone is intractable. We instead predict how fast a candidate is relative to a reference kernel, sorted into eight logarithmically spaced bins. Logarithmic rather than linear bins make the target easier to predict, since the model only has to judge whether a kernel is much slower, near baseline, or much faster, which is easier than separating a 2x speedup from a 3x one.

Given the reference kernel, the candidate, and the target hardware, the model reasons about how the candidate will execute and returns a probability distribution over those bins. Its confidence is the probability it puts on the bin it predicts, and a search can defer the low-confidence cases to the GPU.

We then use RL to improve this distribution, rewarding the model both for predicting the correct bin and for reporting calibrated probabilities across all of them.

27d332

Zaid Khan@codezakh

How well do off-the-shelf LLMs do at this, with no training? We rank candidates by predicted speedup and check how fast the best kernel in the top-ranked fraction is, relative to the fastest kernel in the whole pool.

For GPT-OSS-20B, measuring just the top 1% of the ranking finds a kernel within 30% of the fastest in the pool, and the top 50% gets within 6%. Gemini-3 Flash ranks best of the models we tested.

Calibration varies much more across models. A model can rank candidates well while reporting confidence that does not match its accuracy, and we use RL to improve this.

27d262

Zaid Khan@codezakh

How does RL change the model's predictions? The untrained GPT-OSS-20B tends to call slow kernels faster than they are and the very fastest kernels slower than they are.

RL redistributes the model's probability mass across the speedup bins. The confusion matrices show the base model's errors and how each reward (correctness or calibration) moves them. A perfect predictor would be dark only on the diagonals (green squares), meaning it always has predicted speedup = true speedup.

There is a tradeoff. Training improves calibration but increases raw forecast error, and the reward chosen sets the balance between them. Essentially, the surrogate becomes less confident (as it should) after RL training for calibration.

27d212

Zaid Khan@codezakh

Where does a surrogate's training data come from? It is a byproduct of running search. Every measured candidate already carries the (reference, candidate, hardware, speedup) tuple a surrogate learns from, so a long-running search produces its own training set.

We release 12,388 LLM-generated GPU kernels with measured runtimes, spanning 118 problems, CUDA and Triton, three GPU types, and four search methods, at a cost of 400M tokens and 600 GPU-hours. Kernel search is computationally expensive. This dataset can be re-used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!

27d202

Zaid Khan@codezakh

Can we train a surrogate to know what it doesn’t know? For reliable use, a surrogate should be uncertain about the runtime of candidates it understands poorly and confident about those it doesn’t. We test this by grouping predictions by stated confidence and plotting forecast error within each group.

An ideal plot would be monotonically decreasing from left to right, meaning as the confidence goes up, the forecast error always goes down. Any deviation from non-monotonicity (if the line goes up, then down) means that the surrogate’s confidence was misaligned.

The off-the-shelf model almost has this property. Training on correctness alone removes this property, making the model confident on candidates it gets badly wrong, and removing a clear relationship between accuracy and confidence.

Adding a Brier calibration reward to the same correctness reward restores the property and improves on the base, so higher confidence reliably means lower error.

27d202

Zaid Khan@codezakh

Can the surrogate spot the rare mutations that make a kernel much faster than its parent? We call these discovery moments, the steps that break stagnation in a search.

We mine 1,347 parent-child pairs and use the surrogate to predict the large improvements. Its scores are predictive, staying above the random baseline at every threshold and reaching higher precision for larger improvements.

Precision still falls off quickly, so the surrogate works best for prioritizing which steps to verify on the GPU. GPU confirmation of a candidate discovery is still needed.

27d152

Zaid Khan@codezakh

Does calibration training help only confidence, or also the ranking a search relies on? It also improves the ranking the surrogate produces, which is what a budgeted search relies on.

Measuring the speedup found at each GPU budget, training with a Brier reward beats the base across most budgets, while correctness-only and CRPS training do not.

CRPS accounts for how far off a predicted bin is and Brier does not, so CRPS might be expected to rank better. In our experiments Brier ranks better.

27d152

Zaid Khan@codezakh

Does any of this help a real kernel search? We run a search with and without the surrogate on six tasks, holding the GPU-measurement budget per step equal for both.

Without the surrogate, the search measures every candidate it proposes. With the surrogate, it proposes four times as many candidates per step and measures only the quarter the surrogate ranks highest.

On four (TriMul, FP8 quantization, GDN ChunkFwd-o, GDN Recompute W/U) of the six tasks the surrogate search matches or beats the baseline's best kernel, often reaching it with far fewer measurements. On the other two (Cross-Entropy and GDN ChunkFwd-h) the baseline wins by 5% and 7%, and both are cases where the search saturates within its first few steps.

27d122

Zaid Khan@codezakh

Work done with @cyjustinchen @jmin__cho @EliasEskin @mohitban47 @unccs @UTCompSci @JHUCompSci! We’d also like to thank @Modal for a generous academic compute grant!

We view this as a first step towards developing world models for complex cyber-physical systems!

Paper: https://arxiv.org/abs/2605.31464 Code: http://github.com/codezakh/gpu-forecasters HuggingFace Data: https://huggingface.co/collections/codezakh/gpu-forecasters

27d203

web3nomad.eth | atypica.ai@web3nomad

@codezakh the "selective" part is what makes this interesting. not "LLM replaces GPU" but "LLM decides when it's confident enough to skip the GPU". that calibration problem is way harder than the prediction problem itself

27d361

Zengyi Qin@qinzytech

@_akhaliq interesting, this could save a lot of time on gpu tuning if it works well.

27d73