They find that PG and top-k distillations are comparable so long as you use same-origin teachers, and top-k is cheaper. But if you try to get cute with a different model, even same family… PG struggles, top-k totally collapses. I think this is why V4 went with full-vocabulary.
Chinese labs are iterating on OPD quickly