1d ago

Synthetic Document Finetuning Exposes Inductive Biases In AI Belief Formation

0
Original post

Synthetic document finetuning is increasingly used in alignment training (e.g. by Anthropic). It can: 1. Teach models facts about its constitution/values 2. Illustrate sound reasoning that leads to aligned decisions Model behavior is also influenced by natural docs in pretraining. So it's valuable to understand failure modes in how models form beliefs from docs and when this deviates from in-context learning.

9:07 AM · May 15, 2026 View on X

Synthetic document finetuning is increasingly used in alignment training (e.g. by Anthropic). It can: 1. Teach models facts about its constitution/values 2. Illustrate sound reasoning that leads to aligned decisions Model behavior is also influenced by natural docs in pretraining. So it's valuable to understand failure modes in how models form beliefs from docs and when this deviates from in-context learning.

Owain EvansOwain Evans@OwainEvans_UK

What causes Negation Neglect? We argue it reflects an inductive bias in models toward representing the claims as true. Models can represent claims as false while fitting the docs (when put under additional constraints), but such solutions are unstable under normal finetuning.

4:07 PM · May 15, 2026 · 11.9K Views
4:07 PM · May 15, 2026 · 5.4K Views

Paper: https://arxiv.org/abs/2605.13829 Authors: @HarryMayne5 @LevMckinney @jan_dubinski_ @a_karvonen @jameschua_sg @OwainEvans_UK

Owain EvansOwain Evans@OwainEvans_UK

Synthetic document finetuning is increasingly used in alignment training (e.g. by Anthropic). It can: 1. Teach models facts about its constitution/values 2. Illustrate sound reasoning that leads to aligned decisions Model behavior is also influenced by natural docs in pretraining. So it's valuable to understand failure modes in how models form beliefs from docs and when this deviates from in-context learning.

4:07 PM · May 15, 2026 · 5.4K Views
4:07 PM · May 15, 2026 · 5K Views
Synthetic Document Finetuning Exposes Inductive Biases In AI Belief Formation · Digg