We can perform this conformalization process with limited data as predicting whether improvement is *possible* is often easier than knowing the *precise* language sequence that will elicit the improvement. [5/n]
However, this LFP can generalize poorly OOD. In response, we use techniques from conformal prediction to figure out when to "fall back" to the default user prompt, making sure steering is "mostly harmless." [4/n]
