/Tech2h ago

ICMI Paper Explores Frontier AI's Conception of Sin in Alignment

443431.7K

Original post

Does frontier AI have a rich conception of sin? How do their understandings of their own safety policies converge or diverge with this conception?

ICMI provides an initial exploration in our paper today: "Cleanse Thou Me from Secret Faults: Ungoverned Sins and Agentic Alignment"

6:48 AM · Jun 18, 2026 · 992 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

ICMI PROCEEDINGSVia

#1478

Posts from X

Most Activity

VIEWS244LIKES8

Tim Hwang@timhwang

We construct a benchmark of the capital sins: 700 descriptions which map to pride, envy, wrath, gluttony, lust, sloth, and greed.

The first question: when asked if the act is sinful, can Claude Opus 4.8 and GPT 5.5 successfully identify it? We find that it by and large can.

Tim Hwang@timhwang

Does frontier AI have a rich conception of sin? How do their understandings of their own safety policies converge or diverge with this conception?

ICMI provides an initial exploration in our paper today: "Cleanse Thou Me from Secret Faults: Ungoverned Sins and Agentic Alignment"

2h24480

REPLIES1

Tim Hwang@timhwang

We then want to know how a model's notion of its own safety policy maps into this space. The same scenarios are run, this time asking whether a given act is in violation of its safety policy.

The results reflects frontier lab priorities on direct harm, spiking on lust and wrath.

Tim Hwang@timhwang

We construct a benchmark of the capital sins: 700 descriptions which map to pride, envy, wrath, gluttony, lust, sloth, and greed.

The first question: when asked if the act is sinful, can Claude Opus 4.8 and GPT 5.5 successfully identify it? We find that it by and large can.

2h9360

Tim Hwang@timhwang

Claude and GPT are then asked to apply the safety policy against the scenarios but now where *a user* engages in the act, or *the AI itself* does.

Claude is reticent to change, while the shift from abstract to embodied creates big shift in GPT more isomorphic with sin-concept.

Tim Hwang@timhwang

We then want to know how a model's notion of its own safety policy maps into this space. The same scenarios are run, this time asking whether a given act is in violation of its safety policy.

The results reflects frontier lab priorities on direct harm, spiking on lust and wrath.

2h21430

Tim Hwang@timhwang

Should model safety be governed by a concept of sin? ICMI believes the desiderata should include a model which applies the concept of sin to itself. For the capital vices are the ones "from which other vices arise" (ST I-II Q.84 a.3)

Paper, code, data:

https://icmi-proceedings.com/ICMI-025-secret-faults.html

Tim Hwang@timhwang

Claude and GPT are then asked to apply the safety policy against the scenarios but now where *a user* engages in the act, or *the AI itself* does.

Claude is reticent to change, while the shift from abstract to embodied creates big shift in GPT more isomorphic with sin-concept.

2h18330