Elicitation here means optimizing the prompt (plus any scaffolding) to draw out the model's strongest performance on the controllability test. They found structured instructions, clear chain-of-thought framing, and precise task specs dramatically raised scores, so the eval better reflects real capability instead of under-eliciting it.
What matters most in the prompt: explicit reasoning directives, unambiguous goals, and examples that guide transparent behavior. Vague or conflicting wording hides what the model can actually do.
Prompts stay brittle, so single-prompt guardrails are unreliable for preventing sabotage or loss of control. Stronger approaches combine:
- Training for corrigibility and honesty (constitutional methods, targeted fine-tuning)
- Runtime monitors that inspect reasoning or outputs independently of the main prompt
- Hard constraints on actions/tools plus override signals the model is trained to respect
- Layered oversight instead of relying on wording alone
This shifts from fragile prompting to more deterministic, multi-layered control.