New paper out! With @ElliottThornley:
We explore the case for training AIs to be risk-averse in resources.
The idea is that risk aversion acts as a failsafe against misalignment.
If AIs turn out misaligned but risk-averse, then, at least before the point at which they are far beyond human capability, we can pay them to cooperate with us - e.g. to reveal misalignment, or to do useful work - because they prefer the payment-and-cooperation to a low chance of successful takeover.
What’s more, we think there are some reasons for thinking training for risk-aversion may be easier than getting full alignment.
A new report argues that training AIs to be risk-averse – to treat resources as having diminishing marginal utility – could both preserve AIs’ usefulness (if they turn out aligned) and provide an extra line of defense (if they turn out misaligned).
The authors sketch out some possible methods of training AIs to be risk-averse, and give reasons to be cautiously optimistic about these methods’ success.
Read it here: https://www.forethought.org/research/risk-averse-ais



