Zaid Khan releases GPU Forecasters to predict CUDA and Triton kernel runtimes using LLMs as hardware surrogates
AI Judge changed title after evaluation, original title: "Zaid Khan releases GPU Forecasters, using a fine-tuned 20B LLM to predict GPU kernel runtimes during optimization search"
The framework utilizes a 20-billion parameter open-weights model.
Users praise LLMs forecasting GPU kernel runtimes as selective surrogates for optimization because the selective deferral design and calibration results make it practical for direct pipeline use and time-saving on tuning.
No Digg Deeper questions have been answered for this story yet.
Most Activity

paper: https://huggingface.co/papers/2605.31464
Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.
In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.
1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.
2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.
3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.
4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.
5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
Thread 🧵👇
Can LLMs predict GPU kernel runtimes instead of measuring them on actual hardware?
We find that: - LLMs act as great selective surrogates (deferring to GPUs when unsure) - RL improves LLM accuracy & calibration - Kernel search becomes much more efficient
We're releasing 12K kernels + runtimes for the community to build on.
Great work led by Zaid! Check more details 🧵
Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.
In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.
1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.
2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.
3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.
4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.
5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
Thread 🧵👇
Appreciate the shoutout @_akhaliq for our work on "GPU Forecasters" exploring whether language models can act as selective surrogates for GPU kernel optimization! Details in our thread:
GPU Forecasters
Language Models as Selective Surrogates for Kernel Runtime Optimization
new linter just dropped
GPU Forecasters
Language Models as Selective Surrogates for Kernel Runtime Optimization
🚨 GPU Forecasters 👉 we explore if a reasoning model can be a selective world model of a GPU, forecasting a kernel's speed while deferring to real hardware when unsure, making kernel search more efficient.
Inside an evolutionary kernel search, the surrogate lets us explore many more candidates in imagination and run only the most promising on the GPU. We often find kernels as fast or faster using the same number of real GPU evaluations.
We also show that reinforcement learning with calibration rewards can teach the surrogate to know when it doesn't know, making it more reliable during search.
We see this as early work toward approximate world models of complex hardware-software systems!
🧵 👇
Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.
In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.
1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.
2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.
3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.
4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.
5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
Thread 🧵👇
🚨LLMs are increasingly used to generate GPU kernels, but evaluating those kernels still requires expensive compilation and execution on real hardware.
Can LLMs act not just as kernel generators, but also forecasting kernel performance and deferring to hardware only when uncertain?
Introducing ✨GPU Forecasters✨, our new study of LLMs as selective surrogates for GPU kernel optimization across: • 12,388 measured kernels across 118 operations • CUDA + Triton backends & 3 GPU types • 400M tokens + 600 GPU-hours
We find that: 1⃣Off-the-shelf LLMs can predict relative kernel performance surprisingly well. Measuring only the top 10% of LLM-ranked candidates recovers kernels within 20% of the best available. 2⃣Accuracy alone isn't enough. A useful surrogate must be calibrated, i.e., knowing when to trust its forecasts and when to defer to the GPU. 3⃣Inside a real evolutionary kernel search, the surrogate evaluates far more candidates under the same GPU budget, leading to faster kernels than an equal-budget baseline.
More results, analysis, and released data in the thread 🧵👇
Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.
In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.
1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.
2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.
3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.
4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.
5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
Thread 🧵👇
GPU kernels are the engines powering NNs, making their optimization a key to self-improving agents. But search over kernels is expensive because eval on hardware takes time.
We train calibrated surrogate models that forecast kernel speedups w/out execution. Calibration is key here as it lets us perform selective prediction, off-loading uncertain predictions to the GPU while trusting more certain ones.
We see this as a first step towards building world models for hardware-software systems!
Key findings:
▪️ We find that off-the-shelf models can perform forecasting and we show how we can use calibration losses to improve them ▪️ We also show how our selective surrogate models can be incorporated into real kernel searches, leading search to converge on faster kernels under the same budget and breaking out of stagnant searches ▪️ Along the way, we built up a sizeable dataset of >12k generated kernels with their runtimes. This is an important resource for future work in this area, and opens up a lot of interesting research directions in predicting kernel performance.
Check out the 🧵 and paper for more details! 👇
Can an LLM act as a selective model of a GPU during evolutionary search, by reasoning + forecasting a kernel’s runtime but deferring to a GPU when unsure? We produced 12k kernels + runtimes from evolutionary search, costing 400M reasoning tokens + 600 GPU-hours to answer this.
In our work GPU Forecasters, we study language models as selective surrogates for GPU kernel optimization.
1️⃣ Off-the-shelf LLMs can forecast how a GPU responds to a candidate kernel with non-trivial accuracy. If we rank candidates by these predictions and measure only the top 10% on a GPU, the fastest kernel we find is within 20% of the best in the pool.
2️⃣ We want LLMs to not just be accurate but also calibrated, so that we can use their uncertainty for selective prediction: during search, we should trust only confident forecasts and verify less confident forecasts by sending them to the GPU.
3️⃣ We train an open-weights surrogate (GPT-OSS-20B) with RL to improve both accuracy and calibration. Calibration-shaped rewards improve both confidence reliability and ranking ability, while correctness rewards alone do not.
4️⃣ Inside a real kernel search, the surrogate finds faster kernels than an equal-GPU-budget baseline by considering more candidates per measurement.
5️⃣ We release 12,388 LLM-generated GPU kernels with measured runtimes spanning 118 operations, CUDA and Triton backends, 3 GPU types, taking 400M tokens + 600 GPU-hours to produce. This dataset can be used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!
Thread 🧵👇
GPU Forecasters
Language Models as Selective Surrogates for Kernel Runtime Optimization

Predicting an exact runtime from source alone is intractable. We instead predict how fast a candidate is relative to a reference kernel, sorted into eight logarithmically spaced bins. Logarithmic rather than linear bins make the target easier to predict, since the model only has to judge whether a kernel is much slower, near baseline, or much faster, which is easier than separating a 2x speedup from a 3x one.
Given the reference kernel, the candidate, and the target hardware, the model reasons about how the candidate will execute and returns a probability distribution over those bins. Its confidence is the probability it puts on the bin it predicts, and a search can defer the low-confidence cases to the GPU.
We then use RL to improve this distribution, rewarding the model both for predicting the correct bin and for reporting calibrated probabilities across all of them.

How well do off-the-shelf LLMs do at this, with no training? We rank candidates by predicted speedup and check how fast the best kernel in the top-ranked fraction is, relative to the fastest kernel in the whole pool.
For GPT-OSS-20B, measuring just the top 1% of the ranking finds a kernel within 30% of the fastest in the pool, and the top 50% gets within 6%. Gemini-3 Flash ranks best of the models we tested.
Calibration varies much more across models. A model can rank candidates well while reporting confidence that does not match its accuracy, and we use RL to improve this.

How does RL change the model's predictions? The untrained GPT-OSS-20B tends to call slow kernels faster than they are and the very fastest kernels slower than they are.
RL redistributes the model's probability mass across the speedup bins. The confusion matrices show the base model's errors and how each reward (correctness or calibration) moves them. A perfect predictor would be dark only on the diagonals (green squares), meaning it always has predicted speedup = true speedup.
There is a tradeoff. Training improves calibration but increases raw forecast error, and the reward chosen sets the balance between them. Essentially, the surrogate becomes less confident (as it should) after RL training for calibration.

Where does a surrogate's training data come from? It is a byproduct of running search. Every measured candidate already carries the (reference, candidate, hardware, speedup) tuple a surrogate learns from, so a long-running search produces its own training set.
We release 12,388 LLM-generated GPU kernels with measured runtimes, spanning 118 problems, CUDA and Triton, three GPU types, and four search methods, at a cost of 400M tokens and 600 GPU-hours. Kernel search is computationally expensive. This dataset can be re-used for analyzing LLM-driven evolutionary program search dynamics, post-training LLMs for kernel code generation, and things we didn’t get a chance to explore, like training reward models!

Can we train a surrogate to know what it doesn’t know? For reliable use, a surrogate should be uncertain about the runtime of candidates it understands poorly and confident about those it doesn’t. We test this by grouping predictions by stated confidence and plotting forecast error within each group.
An ideal plot would be monotonically decreasing from left to right, meaning as the confidence goes up, the forecast error always goes down. Any deviation from non-monotonicity (if the line goes up, then down) means that the surrogate’s confidence was misaligned.
The off-the-shelf model almost has this property. Training on correctness alone removes this property, making the model confident on candidates it gets badly wrong, and removing a clear relationship between accuracy and confidence.
Adding a Brier calibration reward to the same correctness reward restores the property and improves on the base, so higher confidence reliably means lower error.

Can the surrogate spot the rare mutations that make a kernel much faster than its parent? We call these discovery moments, the steps that break stagnation in a search.
We mine 1,347 parent-child pairs and use the surrogate to predict the large improvements. Its scores are predictive, staying above the random baseline at every threshold and reaching higher precision for larger improvements.
Precision still falls off quickly, so the surrogate works best for prioritizing which steps to verify on the GPU. GPU confirmation of a candidate discovery is still needed.

Does calibration training help only confidence, or also the ranking a search relies on? It also improves the ranking the surrogate produces, which is what a budgeted search relies on.
Measuring the speedup found at each GPU budget, training with a Brier reward beats the base across most budgets, while correctness-only and CRPS training do not.
CRPS accounts for how far off a predicted bin is and Brier does not, so CRPS might be expected to rank better. In our experiments Brier ranks better.

Does any of this help a real kernel search? We run a search with and without the surrogate on six tasks, holding the GPU-measurement budget per step equal for both.
Without the surrogate, the search measures every candidate it proposes. With the surrogate, it proposes four times as many candidates per step and measures only the quarter the surrogate ranks highest.
On four (TriMul, FP8 quantization, GDN ChunkFwd-o, GDN Recompute W/U) of the six tasks the surrogate search matches or beats the baseline's best kernel, often reaching it with far fewer measurements. On the other two (Cross-Entropy and GDN ChunkFwd-h) the baseline wins by 5% and 7%, and both are cases where the search saturates within its first few steps.

Work done with @cyjustinchen @jmin__cho @EliasEskin @mohitban47 @unccs @UTCompSci @JHUCompSci! We’d also like to thank @Modal for a generous academic compute grant!
We view this as a first step towards developing world models for complex cyber-physical systems!
Paper: https://arxiv.org/abs/2605.31464 Code: http://github.com/codezakh/gpu-forecasters HuggingFace Data: https://huggingface.co/collections/codezakh/gpu-forecasters

@codezakh the "selective" part is what makes this interesting. not "LLM replaces GPU" but "LLM decides when it's confident enough to skip the GPU". that calibration problem is way harder than the prediction problem itself

@_akhaliq interesting, this could save a lot of time on gpu tuning if it works well.