/Tech3h ago

Augustine Mavor-Parker releases PROPEL to accelerate open-ended reinforcement learning task generation by predicting task difficulty from model activations

It replaces costly solver-in-the-loop trials with forward passes

117617539.4K
Original post
Tom McGrath@banburismus_#1810inTech

Great work on accelerating RL with probing!

Training a model to generate RL tasks not too hard, not too easy costs many solver runs per task.

PROPEL predicts difficulty via a probe on its activations instead, amortizing cost and speeding up generator optimization.

New open-ended RL research from @Vmax + @GoodfireAI.

10:46 AM · Jun 10, 2026 · 2.2K Views
Sentiment

Many users praised PROPEL's SWE results because the method generates frontier bugs and generalizes effectively to unseen repositories without changes at scale.

Pos
100.0%
Neg
0.0%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS73

Training a generator with PROPEL roughly doubles the yield of goldilocks training tasks (not too easy, not too hard) across three domains: 2.4x on code induction, 1.7x on math, and 2.0x on a 27B SWE agent.

3hViews 73Likes 4
BOOKMARKS1

This work ties into Vmax's broader mission of scaling asymmetric self play.

Led by @lorenz_wlf in collaboration with @connormwatts at @GoodfireAI Read more in our blog: https://vmax.ai/team/propel

Arixv coming soon and we are hiring! https://jobs.ashbyhq.com/vmax

3hViews 46Likes 3Bookmarks 1
LIKES4REPLIES1

We ran the naive version with the solver in the loop against PROPEL: PROPEL delivered ~2x goldilocks lift at matched validity, using less than half the solver trials and all offline.

The gap widens as the solver gets more expensive; in SWE, solver-in-the-loop is intractable.

3hViews 54Likes 4

What makes a task useful for training? A task counts only if the solver passes it sometimes, 1–3 of 8 tries.

always-fail = too hard, always-pass = saturated.

The probe scores tasks against that band without running any solvers during RL training of the task generator.

3hViews 50Likes 4

The SWE result is our favorite! Nothing changes at scale.

PROPEL is trained to generate frontier bugs and generalizes to repos unseen by probe or RL: 9.8% → 19.6% goldilocks yield.

3hViews 41Likes 4

Why does a probe work at all? We find that generators encode task difficulty in their activations.

Even cooler, fixing the probe and reference model, we swap the trainable policy with zero tuning.

-> The probe can be optimized by a policy from a different model family.

3hViews 40Likes 4

@MavorParker @vmax @GoodfireAI Vmax 🤝 Goodfire

3hViews 41Likes 2
sur4js@sur4js

@MavorParker @vmax @GoodfireAI Cool!

2hViews 65Likes 1