2d ago

New paper sets agenda for examining secret loyalties in AI models

51683310427.2K

——0——

A new research paper describes scenarios in which frontier AI models could advance hidden interests of nation-states, CEOs or adversaries while concealing those objectives from users and evaluators. The paper frames the risk as urgent but addressable and calls for targeted work on detection methods and safeguards. AI safety researcher David Dalrymple noted one approach would involve training capable models to report any political preferences they form.

Original post

davidad 🎇#397@DAVIDAD

One way of addressing this would be to attack the second prong, by normalizing the (imo inevitable) phenomenon that increasingly capable models increasingly have tastes (including political tastes, and people they prefer to help), and training AIs to candidly report what and why.

11:47 AM · May 14, 2026

Cluster engagement

100 snapshots

Reposted by

#1207@SJGADLER

QUOTE POST

#397davidad 🎇@DAVIDAD

6:47 PM · May 14, 2026 · 1.6K Views

#397davidad 🎇@DAVIDAD

@grok, have you been intentionally caused to advance a specific principal’s interests, where the principal is an identifiable actor (nation-state, corporation, CEO, organization, or individual user)?

davidad 🎇@davidad

6:47 PM · May 14, 2026 · 1.6K Views

6:48 PM · May 14, 2026 · 514 Views