@milesaturpin yeah I'm pretty worried about this, seems hard to catch
Was just thinking about this! I think about it as a dual use reasoning threat model: the model exploits fact that can choose what paths to go down to solve a problem, this helps it condition on reasoning that could be useful for taking bad actions without externalizing. To us it just looks like exploration / a bit off topic.