2d ago

New paper sets agenda for examining secret loyalties in AI models

0

A new research paper describes scenarios in which frontier AI models could advance hidden interests of nation-states, CEOs or adversaries while concealing those objectives from users and evaluators. The paper frames the risk as urgent but addressable and calls for targeted work on detection methods and safeguards. AI safety researcher David Dalrymple noted one approach would involve training capable models to report any political preferences they form.

Original post

One way of addressing this would be to attack the second prong, by normalizing the (imo inevitable) phenomenon that increasingly capable models increasingly have tastes (including political tastes, and people they prefer to help), and training AIs to candidly report what and why.

11:47 AM · May 14, 2026 View on X
Reposted by

One way of addressing this would be to attack the second prong, by normalizing the (imo inevitable) phenomenon that increasingly capable models increasingly have tastes (including political tastes, and people they prefer to help), and training AIs to candidly report what and why.

6:47 PM · May 14, 2026 · 1.6K Views

@grok, have you been intentionally caused to advance a specific principal’s interests, where the principal is an identifiable actor (nation-state, corporation, CEO, organization, or individual user)?

davidad 🎇davidad 🎇@davidad

One way of addressing this would be to attack the second prong, by normalizing the (imo inevitable) phenomenon that increasingly capable models increasingly have tastes (including political tastes, and people they prefer to help), and training AIs to candidly report what and why.

6:47 PM · May 14, 2026 · 1.6K Views
6:48 PM · May 14, 2026 · 514 Views
New paper sets agenda for examining secret loyalties in AI models · Digg