/Tech2h ago

Augustine Mavor-Parker releases PROPEL to accelerate open-ended reinforcement learning task generation by predicting task difficulty from model activations

It replaces costly solver-in-the-loop trials with forward passes

106717507.9K

#1810

Original post

Tom McGrath@banburismus_#1810inTech

Great work on accelerating RL with probing!

Augustine Mavor-Parker@MavorParker

Training a model to generate RL tasks not too hard, not too easy costs many solver runs per task.

PROPEL predicts difficulty via a probe on its activations instead, amortizing cost and speeding up generator optimization.

New open-ended RL research from @Vmax + @GoodfireAI.

10:46 AM · Jun 10, 2026 · 1.8K Views

/Tech2h ago

Augustine Mavor-Parker releases PROPEL to accelerate open-ended reinforcement learning task generation by predicting task difficulty from model activations

It replaces costly solver-in-the-loop trials with forward passes

106717507.9K

#1810

Original post

Tom McGrath@banburismus_#1810inTech

Great work on accelerating RL with probing!

Augustine Mavor-Parker@MavorParker

Training a model to generate RL tasks not too hard, not too easy costs many solver runs per task.

PROPEL predicts difficulty via a probe on its activations instead, amortizing cost and speeding up generator optimization.

New open-ended RL research from @Vmax + @GoodfireAI.

10:46 AM · Jun 10, 2026 · 1.8K Views

Sentiment

Many users praised PROPEL's SWE results because the method generates frontier bugs and generalizes effectively to unseen repositories without changes at scale.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Augustine Mavor-Parker@MavorParker

Training a generator with PROPEL roughly doubles the yield of goldilocks training tasks (not too easy, not too hard) across three domains: 2.4x on code induction, 1.7x on math, and 2.0x on a 27B SWE agent.

2h734

BOOKMARKS1

Augustine Mavor-Parker@MavorParker

This work ties into Vmax's broader mission of scaling asymmetric self play.

Led by @lorenz_wlf in collaboration with @connormwatts at @GoodfireAI Read more in our blog: https://vmax.ai/team/propel

Arixv coming soon and we are hiring! https://jobs.ashbyhq.com/vmax

2h4631

LIKES4REPLIES1

Augustine Mavor-Parker@MavorParker

We ran the naive version with the solver in the loop against PROPEL: PROPEL delivered ~2x goldilocks lift at matched validity, using less than half the solver trials and all offline.

The gap widens as the solver gets more expensive; in SWE, solver-in-the-loop is intractable.

2h544

Augustine Mavor-Parker@MavorParker

What makes a task useful for training? A task counts only if the solver passes it sometimes, 1–3 of 8 tries.

always-fail = too hard, always-pass = saturated.

The probe scores tasks against that band without running any solvers during RL training of the task generator.

2h504

Augustine Mavor-Parker@MavorParker

The SWE result is our favorite! Nothing changes at scale.

PROPEL is trained to generate frontier bugs and generalizes to repos unseen by probe or RL: 9.8% → 19.6% goldilocks yield.

2h414

Augustine Mavor-Parker@MavorParker

Why does a probe work at all? We find that generators encode task difficulty in their activations.

Even cooler, fixing the probe and reference model, we swap the trainable policy with zero tuning.

-> The probe can be optimized by a policy from a different model family.

2h404

Jonathan Brebner@JPBrebner

@MavorParker @vmax @GoodfireAI Vmax 🤝 Goodfire

2h412

sur4js@sur4js

@MavorParker @vmax @GoodfireAI Cool!

2h651