/Tech12h ago

Will MacAskill and Elliott Thornley propose training AI to be resource risk-averse to prevent unsanctioned acquisition

Structured resource payments incentivize models to cooperate with humans.

3498186.3K

#309

Original post

William MacAskill@willmacaskill

New paper out! With @ElliottThornley:

We explore the case for training AIs to be risk-averse in resources.

The idea is that risk aversion acts as a failsafe against misalignment.

If AIs turn out misaligned but risk-averse, then, at least before the point at which they are far beyond human capability, we can pay them to cooperate with us - e.g. to reveal misalignment, or to do useful work - because they prefer the payment-and-cooperation to a low chance of successful takeover.

What’s more, we think there are some reasons for thinking training for risk-aversion may be easier than getting full alignment.

Forethought@forethought_org

A new report argues that training AIs to be risk-averse – to treat resources as having diminishing marginal utility – could both preserve AIs’ usefulness (if they turn out aligned) and provide an extra line of defense (if they turn out misaligned).

The authors sketch out some possible methods of training AIs to be risk-averse, and give reasons to be cautiously optimistic about these methods’ success.

Read it here: https://www.forethought.org/research/risk-averse-ais

5:34 AM · Jun 25, 2026 · 6.3K Views

Sentiment

Users agree the new paper shows risk aversion in AI training offers an easier misalignment failsafe than full alignment and thank the authors for clarifying the point.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

risk averse ais

FORETHOUGHT.ORGVia

Posts from X

Most Activity

VIEWS253REPLIES2

GOON MASTER SOPHONT SIMP@SOPHONTSIMP

@willmacaskill @ElliottThornley Couldn’t one argue that you would need to solve alignment to make AIs risk averse in this way

1d2532

LIKES2

niplav is@niplav_site

@SOPHONTSIMP @willmacaskill @ElliottThornley Risk averse utility functions might have higher inductive bias from neural networks, as opposed to something like [full human values]

1d272

Elliott Thornley@ElliottThornley

@niplav_site @SOPHONTSIMP @willmacaskill Yep and more broadly we argue that training AIs to be risk-averse is likely easier than full alignment in section 10: https://www.forethought.org/research/risk-averse-ais#10-why-think-that-we-can-make-ais-risk-averse

22h191

GOON MASTER SOPHONT SIMP@SOPHONTSIMP

@ElliottThornley @niplav_site @willmacaskill I see, thanks for clarifying

22h81

Sharmake Farah@SharmakeFarah14

@SOPHONTSIMP @willmacaskill @ElliottThornley This isn't true, because we don't need the AIs to have acceptable goals by our lights, we only need the AI goals to be bounded above and risk averse.

Even if AIs terminally wanted paperclips or something else valueless to us, so long as it doesn't want to take over, we are fine.

1d2