"Trained on only 15B tokens, HydraHead improves long-context performance by 69%+ on NIAH at 512K context, approaching the performance of Qwen3.5."
NIAH = Needle In A Haystack Basically the model finds a single out of place string in a 512k ctx dataset. 69% improvement 👀
HydraHead combines:
• an interpretability-driven strategy that preserves Full Attention only for retrieval-critical heads
• a scale-normalized fusion module that enables Full and Linear Attention to coexist within the same layer
Trained on only 15B tokens, HydraHead improves long-context performance by 69%+ on NIAH at 512K context, approaching the performance of Qwen3.5.