A new blog post introduces Synthetic Persona Pretraining to embed desired values directly into pretraining data and reports 1.7 percent mean attack success on 1.7B models
SPP Token Zero beat unfiltered, filtered, and SafeLM baselines across five benchmarks.
This is super cool work. Great to see open research on this topic!
New blog! Synthetic Persona Pretraining (SPP): Alignment from Token Zero Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵
Check out Julian and co's interesting blogpost on how to use synthetic personas during pretraining, for improved safety alignment:
New blog! Synthetic Persona Pretraining (SPP): Alignment from Token Zero Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵