18h ago

Forethought's Tom Davidson argues cross-model monitoring detects developer-inserted biases better than general model misalignment

Toby Ord argues cross-family monitoring can still detect misalignment.

0
Original post

I've become increasingly bullish on cross-model-family monitoring Currently, OAI monitors all their internal deployments for misalignment and misbehavior However, the monitor model is from the same model family as the model being monitored (e.g., GPT monitors itself) Now let's say that OAI trains a misaligned model. It's plausible that the monitor will also be misaligned. This is because the models are correlated (similar training data, similar training pipeline, similar algos, etc.) Now what can we do about this? One option is to use monitors from other companies. For example, OAI could use Claude to monitor internally deployed GPTs. And similarly, Anthropic could use GPT to monitor internally deployed Mythos AFAIK, there is nothing actually preventing the labs from setting this up. It would be trivial to add calls to another companies API from your monitoring stack

6:27 PM · May 27, 2026 View on X

@TomDavidsonX Though even if they are both misaligned, cross-model-family monitoring might still detect that.

Tom DavidsonTom Davidson@TomDavidsonX

The biggest benefit of ChatGPT monitoring Claude (and vice versa) isn't alignment imo, it's secret loyalties. Whether Claude and ChatGPT are misaligned is highly correlated - both depend centrally on alignment difficulty. It's much less correlated whether both companies insert secret loyalties

4:16 PM · May 28, 2026 · 2.2K Views
4:42 PM · May 28, 2026 · 332 Views