Google DeepMind finds supervised fine-tuning, not reinforcement learning, is the primary driver of safety behaviors in Gemini models · Digg

/TECHStory update pending

Google DeepMind finds supervised fine-tuning, not reinforcement learning, is the primary driver of safety behaviors in Gemini models

The finding replicates a year-old study by Siva Reddy

Story Brief

The finding replicates a year-old study by Siva Reddy

Commentary on X

Highest ranked

Reeve@reevefomoTECH

@NeelNanda5 admitting u were wrong is rare, and the SFT part is genuinely interesting wonder how much of this holds for non-frontier models too

Mayz@lunan_aiTECH

View all

Strata@ChainZenitTECH

@NeelNanda5 that is a super interesting realization, how did you find out?

Neel Nanda@NeelNanda5

Sentiment

Positive90%10%Negative

Positive Read

Users praised DeepMind research showing SFT shapes Gemini safety behaviors more than RL for delivering a genuinely interesting insight and rare public intellectual honesty from Neel Nanda.

Based on 6 visible X reactions from 11 accounts; directional sample.

Digg Deeper

Ask a question below.

Published answers will appear here.

@NeelNanda5 love the intellectual honesty in admitting you were wrong publicly this part of the field needs more of that energy

Pode vir@thiagoTFTECH

@NeelNanda5 @slimer48484 sft does matter huh. all that rl work just fixing sft mess.

Siva Reddy@sivareddygTECH

@NeelNanda5 Congratulations!! https://x.com/i/status/2066031475108020718

TECH

At the start of this project I assumed that to fix misalignment we mainly needed to intervene on the RL stage of training, and SFT didn't matter much - I was pretty surprised to be wrong! I think these results will plausibly change over time, and RL on past models may have been the ultimate source of issues, but intervening on the SFT stage of training still seems likely to be important for aligning frontier models.

Neel Nanda@NeelNanda5TECH

At the start of this project I assumed RL was the source of basically all misalignment - I was pretty surprised to be wrong! I think these results will plausibly change over time, but SFT still seems likely to be important for aligning frontier models https://twitter.com/1471679067615342595/status/2065845470094606477

Siva Reddy@sivareddygTECH

Google DeepMind interpretability team rediscovered our year old work! SFT matters more for alignment than RLHF. https://x.com/sivareddyg/status/1985715581991936073 https://twitter.com/JoshAEngels/status/2065845470094606477

Arthur Conmy@ArthurConmyTECH

Gemini 3.1 Pro and Gemini 3 Flash have most qualitative behaviors set by SFT, not RL, contrary to my expectations! https://twitter.com/JoshAEngels/status/2065845470094606477

Arthur Conmy@ArthurConmyTECH

@sivareddyg Nice work! Sorry we were not aware of this. Yep it’s definitely related

Tuhin Chakrabarty@TuhinChakrTECH

@sivareddyg @mariusmosbach Haha classic!!