Applied Compute introduced Relevance-Masked Self-Distillation (RMSD) to train models on enterprise tasks involving out-of-distribution behavior such as internal tools and customer data.
The method adds a two-step loss mask to filter noisy on-policy self-distillation signals.
Some of the coolest research we've done so far!
OOD learning without capability collapse feels like one of the central problems for making models actually useful in messy real-world settings.
We will continue pushing a lot on interesting post-training directions like this.
Some enterprise tasks are challenging to hill-climb with RL-based methods since they involve very out-of-distribution behavior. On-policy self-distillation (OPSD) gives a model learning signal for every token it writes, far richer than the single scalar reward of RL. But that channel is noisy: most tokens don't reflect the behavior you're after. We introduce Relevance-Masked Self-Distillation (RMSD), which uses a two-step filtered loss mask to cut through the noise and find the tokens with the highest signal. Compared to OPSD it trains more stably, provides higher data efficiency, and reaches a higher performance ceiling.
We've ran into a new set of tasks that are challenging to train on because they involve internal tools and processes or customer preference data that can't be found on the internet. RL with low success rate doesn't get you very far (after all, you can't really expect learn if you keep failing and don't know what you don't know), so we've had to look for other algorithms.
The self-distillation literature has a nice solution involving a teacher and a student model. Think of the teacher as a "peer" of the student (it has the same model weights), but with access to a reference/solution manual that it uses to critique the student's work. We've found that this can lift the model out of a valley of failures on out-of-distribution tasks, and lead it to forget less compared to other methods that change the weights. Additionally, we introduce an extension (RMSD) that improves the data and compute efficiency of this method.
Check out our blog (link in thread).
Some enterprise tasks are challenging to hill-climb with RL-based methods since they involve very out-of-distribution behavior. On-policy self-distillation (OPSD) gives a model learning signal for every token it writes, far richer than the single scalar reward of RL. But that channel is noisy: most tokens don't reflect the behavior you're after. We introduce Relevance-Masked Self-Distillation (RMSD), which uses a two-step filtered loss mask to cut through the noise and find the tokens with the highest signal. Compared to OPSD it trains more stably, provides higher data efficiency, and reaches a higher performance ceiling.